**Esther Guerra Mariëlle Stoelinga (Eds.)**

# **Fundamental Approaches to Software Engineering**

**24th International Conference, FASE 2021 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021 Luxembourg City, Luxembourg, March 27 – April 1, 2021 Proceedings**

# Lecture Notes in Computer Science 12649

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

# Editorial Board Members

Elisa Bertino, USA Wen Gao, China Bernhard Steffen , Germany Gerhard Woeginger , Germany Moti Yung, USA

# Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this subseries at http://www.springer.com/series/7407

# Fundamental Approaches to Software Engineering

24th International Conference, FASE 2021 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021 Luxembourg City, Luxembourg, March 27 – April 1, 2021 Proceedings

Editors Esther Guerra Universidad Autónoma de Madrid Madrid, Spain

Mariëlle Stoelinga University of Twente Enschede, The Netherlands

Radboud University Nijmegen, The Netherlands

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-71499-4 ISBN 978-3-030-71500-7 (eBook) https://doi.org/10.1007/978-3-030-71500-7

LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# ETAPS Foreword

Welcome to the 24th ETAPS! ETAPS 2021 was originally planned to take place in Luxembourg in its beautiful capital Luxembourg City. Because of the Covid-19 pandemic, this was changed to an online event.

ETAPS 2021 was the 24th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organising these conferences in a coherent, highly synchronised conference programme enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops take place that attract many researchers from all over the globe.

ETAPS 2021 received 260 submissions in total, 115 of which were accepted, yielding an overall acceptance rate of 44.2%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2021 featured the unifying invited speakers Scott Smolka (Stony Brook University) and Jane Hillston (University of Edinburgh) and the conference-specific invited speakers Işil Dillig (University of Texas at Austin) for ESOP and Willem Visser (Stellenbosch University) for FASE. Inivited tutorials were provided by Erika Ábrahám (RWTH Aachen University) on analysis of hybrid systems and Madhusudan Parthasararathy (University of Illinois at Urbana-Champaign) on combining machine learning and formal methods.

ETAPS 2021 was originally supposed to take place in Luxembourg City, Luxembourg organized by the SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg. University of Luxembourg was founded in 2003. The university is one of the best and most international young universities with 6,700 students from 129 countries and 1,331 academics from all over the globe. The local organisation team consisted of Peter Y.A. Ryan (general chair), Peter B. Roenne (organisation chair), Joaquin Garcia-Alfaro (workshop chair), Magali Martin (event manager), David Mestel (publicity chair), and Alfredo Rial (local proceedings chair).

ETAPS 2021 was further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), and EASST (European Association of Software Science and Technology).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König (Duisburg), Gerald Lüttgen (Bamberg), Caterina Urban (INRIA), Tarmo Uustalu (Reykjavik and Tallinn), and Lenore Zuck (Chicago).

Other members of the steering committee are: Patricia Bouyer (Paris), Einar Broch Johnsen (Oslo), Dana Fisman (Be'er Sheva), Jan-Friso Groote (Eindhoven), Esther Guerra (Madrid), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Stefan Kiefer (Oxford), Fabrice Kordon (Paris), Jan Křetínský (Munich), Kim G. Larsen (Aalborg), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Grigore Roșu (Illinois), Peter Ryan (Luxembourg), Don Sannella (Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Mariëlle Stoelinga (Twente), Gabriele Taentzer (Marburg), Christine Tasson (Paris), Peter Thiemann (Freiburg), Jan Vitek (Prague), Anton Wijs (Eindhoven), Manuel Wimmer (Linz), and Nobuko Yoshida (London).

I'd like to take this opportunity to thank all the authors, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2021.

Finally, a big thanks to Peter, Peter, Magali and their local organisation team for all their enormous efforts to make ETAPS a fantastic online event. I hope there will be a next opportunity to host ETAPS in Luxembourg.

February 2021

Marieke Huisman ETAPS SC Chair ETAPS e.V. President

# Preface

This volume contains the papers presented at FASE 2021, the 24th International Conference on Fundamental Approaches to Software Engineering. FASE 2021 was organized as part of the annual European Joint Conferences on Theory and Practice of Software (ETAPS 2021).

FASE is concerned with the foundations on which software engineering is built, including topics like software engineering as an engineering discipline, requirements engineering, software architectures, software quality, model-driven development, software processes, software evolution, search-based software engineering, and the specification, design, and implementation of particular classes of systems, such as (self-)adaptive, collaborative, intelligent, embedded, distributed, mobile, pervasive, cyber-physical, or service-oriented applications.

FASE 2021 received 51 submissions. The submissions came from the following countries (in alphabetical order): Argentina, Australia, Austria, Belgium, Brazil, Canada, China, France, Germany, Iceland, India, Ireland, Italy, Luxembourg, Macedonia, Malta, Netherlands, Norway, Russia, Singapore, South Korea, Spain, Sweden, Taiwan, United Kingdom, and United States. FASE used a double-blind reviewing process. Each submission was reviewed by three Program Committee members. After an online discussion period, the Program Committee accepted 16 papers as part of the conference program (31% acceptance rate).

FASE 2021 hosted the 3rd International Competition on Software Testing (Test-Comp 2021). Test-Comp is an annual comparative evaluation of testing tools. This edition contained 11 participating tools, from academia and industry. These proceedings contain the competition report and three system descriptions of participating tools. The system-description papers were reviewed and selected by a separate program committee: the Test-Comp jury. Each paper was assessed by at least three reviewers. Two sessions in the FASE program were reserved for the presentation of the results: the summary by the Test-Comp chair and the participating tools by the developer teams in the first session, and the community meeting in the second session.

A lot of people contributed to the success of FASE 2021. We are grateful to the Program Committee members and reviewers for their thorough reviews and constructive discussions. We thank the ETAPS 2021 organizers, in particular, Peter Y. A. Ryan (General Chair), Joaquin Garcia-Alfaro (Workshops Chair), Peter Roenne (Organization Chair), Magali Martin (Event Manager), David Mestel (Publicity Chair) and Alfredo Rial (Local Proceedings Chair). We also thank Marieke Huisman (Steering Committee Chair of ETAPS 2021) for managing the process, and Gabriele Taenzter (Steering Committee Chair of FASE 2021) for her feedback and support. Last but not least, we would like to thank the authors for their excellent work.

March 2021 Esther Guerra Mariëlle Stoelinga

# Organization

# Steering Committee


# FASE – Program Committee

João Paulo Almeida Universidade Federal do Espírito Santo, Brazil Étienne André LORIA, Université de Lorraine, France Uwe Aßmann Technische Universität Dresden, Germany Artur Boronat University of Leicester, UK Paolo Bottoni Sapienza University of Rome, Italy Jordi Cabot ICREAUniversitat Oberta de Catalunya, Spain Yu-Fang Chen Academia Sinica, Taiwan Philippe Collet Université Côte d'Azur - CNRS/I3S, France Francisco Durán University of Málaga, Spain Marie-Christine Jakobs Technische Universität Darmstadt, Germany Nils Jansen Radboud University Nijmegen, The Netherlands Einar Broch Johnsen University of Oslo, Norway Leen Lambers Hasso-Plattner-Institut, Universität Potsdam, Germany Yi Li Nanyang Technological University, Singapore Stefan Mitsch Carnegie Mellon University, USA Martin R. Neuhäußer Siemens AG, Germany Ajitha Rajan University of Edinburgh, UK Augusto Sampaio Federal University of Pernambuco, Brazil Perdita Stevens University of Edinburgh, UK Daniel Strüber Radboud University Nijmegen, The Netherlands Gabriele Taentzer Philipps-Universität Marburg, Germany


# Test-Comp – Program Committee and Jury


# Additional Reviewers

Antonino, Pedro Babikian, Aren Badings, Thom Bubel, Richard Búr, Márton Cánovas Izquierdo, Javier Luis Chang, Yun-Sheng

Clarisó, Robert De Lara, Juan Din, Crystal Chang Du, Xiaoning Gómez, Abel Hajdu, Ákos Haltermann, Jan

Kamburjan, Eduard König, Jürgen Lehner, Daniel Lienhardt, Michael Lin, Tzu Chi Martens, Jan Mey, Johannes Morgenstern, Martin Mukelabai, Mukelabai Oliveira, Marcel Vinícius Medeiros Oruc, Orcun Osama, Muhammad Pun, Violet Ka I.

Richter, Cedric Sharma, Arnab Steffen, Martin Stolz, Volker Suilen, Marnix Szárnyas, Gábor Tang, Yun Tsai, Wei-Lun Veeraragavan, Narasimha Raghavan Waga, Masaki Weinreich, Rainer Wu, Xiuheng Zhu, Chenguang

# Contents

#### FASE Contributions



# **FASE Contributions**

#### On Benchmarking for Concurrent Runtime Verification*-*

Luca Aceto<sup>2</sup>,<sup>3</sup> ID , Duncan Paul Attard-,1,2 ID , Adrian Francalanza<sup>1</sup> ID , and Anna Ingólfsdóttir<sup>2</sup> ID

<sup>1</sup> University of Malta, Msida, Malta {duncan.attard.01,afra1}@um.edu.mt

<sup>2</sup> Reykjavík University, Reykjavík, Iceland {luca,duncanpa17,annai}@ru.is <sup>3</sup> Gran Sasso Science Institute, L'Aquila, Italy {luca.aceto}@gssi.it

Abstract. We present a synthetic benchmarking framework that targets the systematic evaluation of RV tools for message-based concurrent systems. Our tool can emulate various load profiles via configuration. It provides a multi-faceted view of measurements that is conducive to a comprehensive assessment of the overhead induced by runtime monitoring. The tool is able to generate significant loads to reveal edge case behaviour that may only emerge when the monitoring system is pushed to its limit. We evaluate our framework in two ways. First, we conduct sanity checks to assess the precision of the measurement mechanisms used, the repeatability of the results obtained, and the veracity of the behaviour emulated by our synthetic benchmark. We then showcase the utility of the features offered by our tool in a two-part RV case study.

Keywords: Runtime verification · Synthetic benchmarking · Software performance evaluation · Concurrent systems

# 1 Introduction

Large-scale software design has shifted from the classic monolithic architecture to one where applications are structured in terms of independently-executing asynchronous components [17]. This shift poses new challenges to the validation of such systems. Runtime Verification (RV) [9,27] is a *post-deployment* technique that is used to complement other methods such as testing [46] to assess the *functional* (*e.g.* correctness) and *non-functional* (*e.g.* quality of service) aspects of concurrent software. RV relies on instrumenting the system to be analysed with monitors, which inevitably introduce *runtime overhead* that should be kept minimal [9]. While the worst-case complexity bounds for monitor-induced overheads can be calculated via standard methods (see, *e.g.* [40,14,1,28]), *benchmarking* is, by far, the preferred method for assessing these overheads [9,27]. One reason for

<sup>-</sup> Supported by the doctoral student grant (No: 207055-051) and the TheoFoMon project (No: 163406-051) under the Icelandic Research Fund, the BehAPI project funded by the EU H2020 RISE under the Marie Skłodowska-Curie action (No: 778233), the ENDEAVOUR Scholarship Scheme (Group B, national funds), and the MIUR project PRIN 2017FTXR7S IT MATTERS.

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 3–23, 2021. https://doi.org/10.1007/978-3-030-71500-7\_1

this choice is that benchmarks tend to be more *representative* of the overhead observed in practice [30,15]. Benchmarks also provide a *common platform* for gauging workloads, making it possible to *compare* different RV tool implementations, or rerun experiments to *reproduce* and *confirm* existing results.

The utility of a benchmarking tool typically rests on two aspects: *(i)* the *coverage* of scenarios of interest, and *(ii)* the quality of *runtime metrics* collected by the benchmark harness. To represent scenarios of interest, benchmarking tools generally employ suites of third-party *off-the-shelf (OTS) programs* (*e.g.* [60,11,59]). OTS software is appealing because it is readily usable and inherently provides realistic scenarios. By and large, benchmarks rely on a range of OTS programs to broaden the coverage of real-world scenarios (*e.g.* DaCapo [11] uses 11 open-source libraries). Yet, using OTS programs as benchmarks poses challenges. By design, these programs do *not* expose hooks that enable harnesses to easily and accurately gather the runtime metrics of interest. When OTS software is treated as a black box, benchmarks become harder to control, impacting their ability to produce repeatable results. OTS software-based benchmarks are also limited when inducing specific edge cases—this aspect is critical when assessing the safety of software, such as runtime monitors, that are often assumed to be *dependable*. Custom-built *synthetic programs* (*e.g.* [35]) are an alternative way to perform benchmarking. These tend to be less popular due to the perceived drawbacks associated with developing such programs from scratch, and the lack of 'real-world' behaviour intrinsic to benchmarks based on OTS software. However, synthetic benchmarks offer benefits that offset these drawbacks. For example, *specialised* hooks can be built into the synthetic set-up to collect a broad range of runtime metrics. Moreover, synthetic benchmarks can also be *parametrised* to emulate variations on the same core benchmark behaviour; this is usually harder to achieve via OTS programs that implement narrow use cases.

Established benchmarking tools such as SPECjvm2008 [60], DaCapo [11], ScalaBench [59] and Savina [35]—developed for the JVM—feature extensively in the RV literature, *e.g.* see [48,19,18,54,13,45]. Apart from [45], these works assess the runtime overhead solely in terms of the *execution slowdown*, *i.e.,* the difference in running time between the system fitted with and without monitors. Recently, the International RV competition (CRV) [8] advocated for other metrics, such as *memory consumption*, to give a more qualitative view of runtime overhead. We hold that RV set-ups that target concurrency benefit from other facets of runtime behaviour, such as the *response time*, that captures the overhead between communicating components. Tangibly, this metric reflects the *perceived reactiveness* from an end-user standpoint (*e.g.* interactive apps) [50,61,58,21]; more generally, it describes the *service degradation* that must be accounted for to ensure adequate quality of service [15,39]. Arguably, benchmarking tools like the ones above (*e.g.* Savina) should provide even more. Often, RV set-ups for concurrent systems *need* to scale in response to dynamic changes, and the capacity for a benchmark to emulate *high loads* cannot be overstated. In actual fact, these loads are known to assume characteristic *profiles* (*e.g.* spikes or uniform rates), which are hard to administer with the benchmarks mentioned earlier.

The state of the art in benchmarking for concurrent RV suffers from another issue. Existing benchmarks—conceived for validating other tools—are repurposed for RV and often *fail* to cater for concurrent scenarios where RV is realistically put to use. SPECjvm2008, DaCapo, and ScalaBench lack workloads that leverage the JVM concurrency primitives [52]; meanwhile, [12] shows that the Savina microbenchmarks are essentially sequential, and that the rest of the programs in the suite are sufficiently simple to be regarded as microbenchmarks too. The CRV suite mostly targets *monolithic* software with limited concurrency, where the potential for scaling up to high loads is, therefore, severely curbed.

This paper presents a benchmarking framework for evaluating *runtime monitoring* tools written for verification purposes. Our tool focusses on component systems for asynchronous message-passing concurrency. It generates synthetic system models following the *master-slave* architecture [61]. The master-slave architecture is pervasive in distributed (*e.g.* DNS, IoT) and concurrent (*e.g.* web servers, thread pools) systems [61,29], and lies at the core of the MapReduce model [22] supported by Big Data frameworks such as Hadoop [63]. This justifies our aim to build a benchmarking tool targeting this architecture. Concretely:


#### 2 Benchmark Design and Implementation

Our set-up can emulate a range of system models and subject them to various load types. We consider master-slave architectures, where one central process, called the *master*, creates and allocates tasks to *slave* processes [61]. Slaves work concurrently on tasks, relaying the result to the master when ready; the latter then combines these results to yield the final output. Our slaves are an *abstraction* of sets of cooperating processes that can be treated as a single unit.

#### 2.1 Approach

We target concurrent applications that execute on a single node. Nevertheless, our design adheres to three criteria that facilitate its extension to a distributed setting. Specifically, components: *(i)* share neither a common clock, *(ii)* nor memory, and *(iii)* communicate via asynchronous messages. Our present set-up assumes that communication is reliable and components do not fail.

*Load generation.* Load on the system is induced by the master when it creates slave processes and allocates *tasks*. The total number of slaves in one run can be set via the parameter n. Tasks are allocated to slave processes by the master, and consist of one or more *work requests* that a slave receives, handles, and relays back. A slave terminates its execution when all of its allocated work requests have been processed *and* acknowledged by the master. The number of work requests that *can* be batched in a task is controlled by the parameter w; the *actual* batch size per slave is then drawn randomly from a normal distribution with mean μ=w and standard deviation σ=μ×0.02. This induces a degree of variability in the amount of work requests exchanged between master and slaves. The master and slaves communicate *asynchronously*: an allocated work request is delivered to a slave process' incoming work queue where it is eventually handled. Work responses issued by a slave are queued and processed similarly on the master.

*Load configuration.* We consider *three load profiles* (see fig. 3 for examples) that determine how the creation of slaves is distributed along the load timeline t. The timeline is modelled as a sequence of *discrete logical time units* representing instants at which a new set of slaves is created by the master. *Steady* loads replicate executions where a system operates under stable conditions. These are modelled on a homogeneous Poisson distribution with *rate* λ, specifying the mean number of slaves that are created at each time instant along the load timeline with duration t=n/λ. *Pulse* loads emulate settings where a system experiences gradually increasing load peaks. The Pulse load shape is parametrised by t and the *spread*, s, that controls how slowly or sharply the system load increases as it approaches its maximum peak, halfway along t. Pulses are modelled on a normal distribution with μ=t/2 and σ=s. *Burst* loads capture scenarios where a system is stressed due to load spikes; these are based on a log-normal distribution with μ=ln(m<sup>2</sup>/ p<sup>2</sup> +m<sup>2</sup>) and σ =ln(1+p<sup>2</sup>/m<sup>2</sup>), where m=t/2, and parameter p is the *pinch* controlling the concentration of the initial load burst.

*Wall-clock time.* A load profile created for a logical timeline t is put into effect by the master process when the system starts running. The master *does not* create the slave processes that are set to execute in a particular time unit *in one go*, since this naïve strategy risks saturating the system, deceivingly increasing the load. In doing so, the system may become overloaded not because the mean request rate is high, but because the created slaves overwhelm the master when they send their requests all at once. We address this issue by introducing the notion of *concrete time* that maps one discrete time unit in t to a real time *period*, π. The parameter π is given in milliseconds (ms), and defaults to 1000 ms.

*Slave scheduling.* The master process employs a scheduling scheme to distribute the creation of slaves uniformly across the time period π. It makes use of three queues: the *Order* queue, *Ready* queue, and *Await* queue, denoted by QO, QR, and Q<sup>A</sup> respectively. Q<sup>O</sup> is initially populated with the load profile, step <sup>1</sup> in fig. 1a. The load profile consists of an array with t elements—each corresponding to a discrete time instant in t—where the value l of every element indicates the number of slaves to be created at that instant. Slaves, S1,S2,...,Sn, are scheduled and created in *rounds*, as follows. The master picks the first element from Q<sup>O</sup>

Legend: Selected for processing Slave created Slave terminated

(a) Master schedules the first batch of four slaves for execution in Q<sup>R</sup>

(b) Slaves S<sup>1</sup> and S<sup>2</sup> created and added to QA; a work request is sent to S<sup>1</sup>

(c) Slaves S<sup>3</sup> and S<sup>4</sup> created and added to QA; slave S<sup>2</sup> completes its execution

(d) Q<sup>R</sup> becomes empty; master schedules the next batch of two slaves

Fig. 1: Master M scheduling slave processes S<sup>j</sup> and allocating work requests

to compute the upcoming schedule, step <sup>2</sup> , that starts at the *current* time, c, and finishes at c + π. A series of l time points, p1,p2,...,pl, in the schedule period π are *cumulatively* calculated by drawing the next p<sup>i</sup> from a normal distribution with μ=π/l and σ=μ×0.1. Each time point stipulates a moment in *wall-clock* time when a new slave S<sup>j</sup> is to be created; this set of time points is *monotonic*, and constitutes the Ready queue, QR, step <sup>3</sup> . The master checks QR, step <sup>4</sup> in fig. 1b, and creates the slaves whose time point p<sup>i</sup> is smaller than or equal to the current wall-clock time<sup>4</sup>, steps <sup>5</sup> and <sup>6</sup> in fig. 1b. The time point p<sup>i</sup> of a newly-created slave is removed from QO, and an entry for the corresponding slave S<sup>j</sup> is appended to the Await queue QA; this is shown in step <sup>7</sup> for S<sup>1</sup> and S2. Slaves in Q<sup>A</sup> are now ready to receive work requests from the master process, *e.g.* step <sup>8</sup> . Q<sup>A</sup> is traversed by the master at this stage so that work requests can be allocated to existing slaves. The master continues processing queue Q<sup>R</sup> in subsequent rounds, creating slaves, issuing work requests, and updating Q<sup>R</sup> and Q<sup>A</sup> accordingly as shown in steps <sup>9</sup> – <sup>13</sup>

<sup>4</sup> We assume that the platform scheduling the master and slave processes is fair.

in fig. 1c. At any point, the master can receive responses, *e.g.* step <sup>17</sup> in fig. 1d; these are *buffered* inside the masters' incoming work queue and handled once the scheduling and work allocation phases are complete. A *fresh* batch of slaves from Q<sup>O</sup> is scheduled by the master whenever Q<sup>R</sup> becomes empty, step <sup>15</sup> , and the described procedure is repeated. The master stops scheduling slaves when all the entries in Q<sup>O</sup> are processed. It then transitions to *work-only* mode, where it continues allocating work requests and handling incoming responses from slaves.

*Reactiveness and task allocation.* Systems generally respond to load with differing rates, due to the computational complexity of the task at hand, IO, or slowdown when the system itself becomes gradually loaded. We simulate these phenomena using the parameters Pr(send) and Pr(recv). The master *interleaves* the processing of work requests to allocate them uniformly among the various slaves: Pr(send) and Pr(recv) bias this behaviour. Specifically, Pr(send) controls the probability that a work request is sent by the master to a slave, whereas Pr(recv) determines the probability that a work response received by the master is processed. Sending and receiving is *turn-based* and modelled on a Bernoulli trial. The master picks a slave S<sup>j</sup> from Q<sup>A</sup> and sends *at least* one work request when X ≤ Pr(send), *i.e.,* the Bernoulli trial succeeds; X is drawn from a uniform distribution on the interval [0,1]. Further requests to the *same* slave are allocated following this scheme (steps <sup>8</sup> , <sup>13</sup> and <sup>20</sup> in fig. 1) and the entry for S<sup>j</sup> in Q<sup>A</sup> is updated accordingly with the number of work requests remaining. When X > Pr(send), *i.e.,* the Bernoulli trial fails, the slave misses its turn, and the next slave in Q<sup>A</sup> is picked. The master also queries its incoming work queue to determine whether a response can be processed. It dequeues one response when X ≤ Pr(recv), and the attempt is repeated for the next response in the queue until X > Pr(recv). The master signals slaves to terminate once it acknowledges all of their work responses (*e.g.* step <sup>14</sup> ). Due to the load imbalance that may occur when the master becomes overloaded with work responses relayed by slaves, dequeuing is repeated |QA| times. This encourages an even load distribution in the system as the number of slaves *fluctuates* at runtime.

#### 2.2 Realisability

The set-up detailed in sec. 2.1 is easily translatable to the actor model of computation [2]. In this model, the basic units of decomposition are *actors*: concurrent entities that do not share mutable memory with other actors. Instead, they interact via *asynchronous messaging*. Each actor owns an incoming message buffer called the *mailbox*. Besides sending and receiving messages, an actor can also *fork* other child actors. Actors are uniquely addressable via a dynamically-assigned *identifier*, often referred to as the PID. Actor frameworks such as Erlang [16], Akka [55] for Scala [51], and Thespian [53] for Python [44] implement actors as *lightweight* processes to enable highly-scalable architectures that span multiple machines. The terms *actor* and *process* are used interchangeably henceforth.

*Implementation.* We use Erlang to implement the set-up of sec. 2.1. Our implementation maps the master and slave processes to actors, where slaves are forked by the master via the Erlang function spawn(); in Akka and Thespian ActorContext.spawn() and Actor.createActor() can be respectively used to the same effect. The work request queues for both master and slave processes coincide with actor mailboxes. We abstract the task computation and model work requests as Erlang messages. Slaves emulate no delay, but respond instantly to work requests once these have been processed; delay in the system can be induced via parameters Pr(send) and Pr(recv). To maximise efficiency, the Order, Ready and Await queues used by our scheduling scheme are maintained *locally* within the master. The master process keeps track of other details, such as the total number of work requests sent and received, to determine when the system should stop executing. We extend the parameters in sec. 2.1 with a *seed* parameter, r, to fix the Erlang pseudorandom number generator to output reproducible number sequences.

#### 2.3 Measurement Collection

To give a multi-faceted view of runtime overhead, we extend the approach in [8] and, apart from the *(i)* mean *execution duration*, measured in seconds (s), we also collect the *(ii)* mean *scheduler utilisation*, as a percentage of the total available capacity, *(iii)* mean *memory consumption*, measured in GB, and, *(iv)* mean *response time (RT)*, measured in milliseconds (ms). Our definition of runtime overhead encompasses all four metrics. Measurement taking largely depends on the platform on which the benchmark executes, and one often leverages *platformspecific* optimised functionality in order to attain high levels of efficiency. Our implementation relies on the functionality provided by the Erlang ecosystem.

*Sampling.* We collect measurements centrally using a special process, called the *Collector*, that samples the runtime to obtain periodic snapshots of the execution environment (see fig. 2). Sampling is often necessary to induce low overhead in the system, especially in scenarios where the system components are sensitive to latency [32]. Our sampling frequency is set to 500 ms: this figure was determined empirically, whereby the measurements gathered are neither too coarse, nor excessively fine-grained such that sampling affects the runtime. Every sampling snapshot combines the four metrics mentioned above and formats them as records that are written *asynchronously* to disk to minimise IO delays.

*Performance metrics.* Memory and scheduler readings are gathered via the Erlang Virtual Machine (EVM). We sample scheduler—rather than CPU utilisation at the OS-level—since the EVM keeps scheduler threads momentarily spinning to remain reactive; this would inflate the metric reading. The overall system responsiveness is captured by the mean RT metric. Our Collector exposes a hook that the master uses to obtain *unique timestamps*, step <sup>1</sup> in fig. 2. These are embedded in all work request messages the master issues to slaves. Each timestamp enables the Collector to track the time taken for a message to travel from the master to a slave and back, *including* the time it spends in the master's mailbox until dequeued, *i.e.,* the round-trip in steps <sup>2</sup> – <sup>5</sup> . To efficiently compute the RT, the Collector samples the total number of messages exchanged between the master and slaves, and calculates the mean using Welford's online algorithm [62].

Fig. 2: Collector tracking the round-trip time for work requests and responses

# 3 Evaluation

We evaluate our synthetic benchmarking tool developed as described in Sec. 2 in a number of ways. In sec. 3.1, we discuss sanity checks for its measurement collection mechanisms, and assess the repeatability of the results obtained from the synthetic system executions. Crucially, sec. 3.1 provides evidence that the benchmarking tool is sufficiently expressive to cover a number of execution profiles that are shown to emulate realistic scenarios. Sec. 3.2 demonstrates the utility of the features offered by our tool for the purposes of assessing RV tools.

*Experiment set-up.* We define an *experiment* to consist of ten benchmarks, each performed by running the system set-up with incremental loads. Our experiments were performed on an Intel Core i7 M620 64-bit machine with 8GB of memory, running Ubuntu 18.04 LTS and Erlang/OTP 22.2.1.

#### 3.1 Benchmark Expressiveness and Veracity

The parameters for the tool detailed in sec. 2.1 can be configured to model a range of master-slave scenarios. However, not all of these configurations are meaningful in practice. For example, setting Pr(send)=0 does not enable the master to allocate work requests to slaves; with Pr(send)=1, this allocation is enacted sequentially, defeating the purpose of a concurrent master-slave system. In this section, we establish a set of parameter values that model experiment setups whose behaviour *approximates* that of master-slave systems typically found in practice. Our experiments are conducted with n=500k slaves and w=100 work requests per slave. This generates ≈n×w×(work requests and responses)=100M message exchanges between the master and slaves. We initially fix Pr(send) = Pr(recv)=0.9, and choose a Steady (*i.e.,* Poisson process) load profile since this features in industry-strength load testing tools such as Tsung [49] and JMeter [3]. Fig. 3 shows the load applied at each benchmark run, *e.g.* on the tenth run, the benchmark uses ≈ 5k slaves/s. The total loading time is set to t = 100s.

*Measurement precision.* A series of trials were conducted to select the appropriate sampling window size for the RT. This step is crucial because it directly affects the capability of the benchmark to scale in terms of its number of slave processes and work requests. Our RT sampling of sec. 2.3 (see also fig. 2) was calibrated by taking various window sizes over numerous runs for different load profiles of ≈ 1M slaves. The results were compared to the *actual* mean calculated on *all* work request and response messages exchanged between master and slaves. Window sizes close to 10 % yielded the best results (≈ ±1.4% discrepancy from the actual RT). Smaller window sizes produced excessive discrepancy; larger sizes induced noticeably higher system loads. We also cross-checked the precision of our sampling method of the scheduler utilisation against readings obtained via the Erlang Observer tool [16] to confirm that these coincide.

*Experiment repeatability.* Data variability affects the *repeatability* of experiments. It also plays a role when determining the number of repeated readings, k, required before the data measured is deemed *sufficiently representative*. Choosing the lowest k is crucial when experiment runs are time consuming. The *coefficient of variation* (CV)—*i.e.,* the ratio of the standard deviation to the mean, CV = <sup>σ</sup> <sup>x</sup>¯ × 100—can be used to establish the value of k empirically, as follows. Initially, the CV<sup>k</sup> for one batch of experiments for some number of repetitions k is calculated. The result is then compared to the CV<sup>k</sup> for the next batch of repetitions k=k+b, where b is the step size. When the difference between successive CV metrics k and k is sufficiently small (for some percentage ), the value of k is chosen, otherwise the described procedure is repeated with k . Crucially, this condition must hold for *all variables* measured in the experiment before k can be fixed. For the results presented next, the CV values were calculated manually. The mechanism that determines the CV automatically is left for future work.

*Data variability.* The data variability between experiments can be reduced by seeding the Erlang pseudorandom number generator (parameter r in sec. 2.2) with a constant value. This, in turn, tends to require fewer repeated runs before the metrics of interest—scheduler utilisation, memory consumption, RT, and execution duration—converge to an acceptable CV. We conduct experiment sets with three, six and nine repetitions. For the majority of cases, the CV for our metrics is *lower* when a fixed seed is used, by comparison to its unseeded counterpart. In fact, very low CV values for the scheduler utilisation, memory consumption, RT, and execution duration, 0.17 %, 0.15 %, 0.52 % and 0.47 % respectively, were obtained with three repeated runs. We thus set the number of repetitions to *three* for *all* experiment runs in the sequel. Note that fixing the seed *still* permits the system to exhibit a modicum of variability that stems from the inherent *interleaved execution* of components due to process scheduling.

*Load profiles.* Our tool is expressive enough to generate the load profiles introduced in sec. 2.1 (see fig. 3), enabling us to gauge the behaviour of monitoring set-ups under varying forms of loads. These loads make it possible to mock specific system scenarios that test different implementation aspects. For example, a benchmark configured with load surges could uncover buffer overflows in a particular monitoring implementation that only arise under stress when the length of the request queue exceeds some preset length.

*System reactivity.* The reactivity of the master-slave system correlates with the idle time of each slave which, in turn, affects the capacity of the system to *absorb*

Fig. 3: Steady, Pulse and Burst load distributions of 500 k slaves for 100 s

overheads. Since this can skew the results obtained when assessing overheads, it is imperative that the benchmarking tool provides methods to control this aspect. The parameters Pr(send) and Pr(recv) regulate the speed with which the system reacts to load. We study how these parameters affect the overall performance of system models set up with Pr(send) = Pr(recv)∈ {0.1,0.5,0.9}. The results are shown in fig. 4, where each metric (*e.g.* memory consumption) is plotted against the total number of slaves. At Pr(send)=Pr(recv)=0.1, the system has the lowest RT out of the three configurations (bottom left), as indicated by the gentle linear increase of the plot. One may expect the RT to be *lower* for the system models configured with probability values of 0.5 and 0.9. However, we recall that with Pr(send)=0.1, work requests are allocated infrequently by the master, so that slaves are *often idle*, and can *readily* respond to (low numbers of) incoming work requests. At the same time, this prolongs the execution duration, when compared to that of the system set with Pr(send) = Pr(recv)∈ {0.5,0.9} (bottom right). This effect of slave idling can be gleaned from the relatively lower scheduler utilisation as well (top left). Idling increases memory consumption (top right), since slaves created by the master typically remain alive for extended periods. By contrast, the plots set with Pr(send)=Pr(recv)∈{0.5,0.9} exhibit markedly gentler gradients in the memory consumption and execution duration charts; corresponding linear slopes can be observed in the RT chart. This indicates that values between 0.5 and 0.9 yield system models that: *(i)* consume reasonable amounts of memory, *(ii)* execute in respectable amounts of time, and *(iii)* maintain tolerable RT. Since master-slave architectures are typically employed in settings where high throughput is demanded, choosing values smaller than 0.5 goes against this principle. In what follows, we opt for Pr(send)=Pr(recv)=0.9.

*Emulation veracity.* Our benchmarks can be configured to closely model *realistic* web server traffic where the request intervals observed at the server are known to follow a Poisson process [31,43,37]. The probability distribution of the RT of web application requests is generally right-skewed, and approximates log-normal [31,20] or Erlang distributions [37]. We conduct three experiments using *Steady loads* fixed with n= 10k for Pr(send)=Pr(recv) ∈ {0.1,0.5,0.9} to

Fig. 4: Performance benchmarks of system models for Pr(send) and Pr(recv)

establish whether the RT in our system set-ups resembles the aforementioned distributions. Our results, summarised in fig. 5, were obtained by estimating the parameters for a set of candidate probability distributions (*e.g.* normal, log-normal, gamma, *etc.*) using maximum likelihood estimation [56] on the RT obtained from *each* experiment. We then performed goodness-of-fit tests on these parametrised distributions using the Kolmogorov-Smirnov test, selecting the most appropriate RT fit for each of the three experiments. The fitted distributions in fig. 5 indicate that the RT of our system models follows the findings reported in [31,20,37]. This makes a strong case in favour of our benchmarking tool striking a balance between the *realism* of benchmarks based on OTS programs and the *controllability* offered by synthetic benchmarking. Lastly, we point out that fig. 5 matches the observations made in fig. 4, which show an increase in the mean RT as the system becomes more reactive. This is evident in the histogram peaks that grow shorter as Pr(send)=Pr(recv) progresses from 0.1 to 0.9.

#### 3.2 Case Study

We demonstrate how our benchmarking tool can be used to assess the runtime overhead comprehensively via a concurrent RV case study. By controlling the benchmark parameters and subjecting the system to specific workloads, we show that our multi-faceted view of overhead reveals nuances in the observed runtime behaviour, benefitting the interpretation of empirical results. We further assess the veracity of these synthetic benchmarks against the overhead measured from a use case that considers industry-strength OTS applications.

Fig. 5: Fitted probability distributions on RT for Steady loads for n= 10k

The RV Tool We use a RV tool to objectively compare the conclusions derived from our synthetic benchmarks against those obtained from the experiment set up with the OTS applications. The tool under scrutiny targets concurrent Erlang programs [4]. It synthesises *automata-like* monitors from sHML specifications [26] and *inlines* them into the system via *code injection* by manipulating the program abstract syntax tree. Inline instrumentation underlies various other state-of-the-art RV tools, such as JavaMOP [36], MarQ [54], Java-MaC [38] and RiTHM [47]. sHML is a fragment of the Hennessy-Milner Logic with recursion [41] that can express all regular safety properties [26]. The tool augments it to handle pattern matching and data dependencies for three kinds of event patterns, namely *send* and *receive* actions, denoted by ! and ? respectively, and process *crash*, denoted by . This suffices to specify properties of both the master and slave processes, resulting in the set-up depicted in fig. 6a. For instance, the recursive property ϕ<sup>s</sup> describes an *invariant* of the master-slave communication protocol (from the slave's point of view), stating that '*a slave processing integer successor requests should not crash*':

$$\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\mathtt{\cdots}}}}}}}}}}}}}}}\iota}\}\dots}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\}\} \} \} \bullet } $$

The key construct in sHML is the modal formula [p]ϕ, stating that *whenever* a satisfying system exhibits an event e matching pattern p, its continuation then satisfies ϕ. In property ϕs, the invariant—denoted by recursion binder maxX asserts that a slave *Slv* does not crash, specified by sub-formula <sup>1</sup> . It further stipulates in sub-formula <sup>2</sup> that when a request-carrying payload, *Req* is received, 2.1 , *Slv* cannot crash, 3.1 , *and* if the slave replies to *Req* with the payload *Req* + 1, the property *recurses* on variable X, 3.2 . Action patterns use two types of value variables: binders, \*x* , that are pattern-matched to concrete values learnt at runtime, and variable instances, *x* , that are bound by the respective binders and instantiated to concrete data via pattern matching at runtime. This induces the usual notion of free and bound value variables; we assume closed terms. For example, when checking property ϕ<sup>s</sup> against the trace event pid?42, the analysis unfolds the sub-formula guarded by maxX, matching the event with the pattern \*Slv* ?\*Req* in 2.1 . Variables *Slv* and *Req* are substituted with pid and 42 respectively in property ϕs, leaving the residual formula:

[pid]ff <sup>∧</sup>[pid!(42+ 1)]maxX.[\*Slv* -]ff ∧ [\*Slv* ? \*Req*] [*Slv* -]ff <sup>∧</sup>[*Slv* !(*Req* + 1)]<sup>X</sup> 

The RV tool under scrutiny produces inlined monitor code that executes in the same process space of system components (see fig. 6a), yielding the lowest possible amount of runtime overhead. This enables us to scale our benchmarks to considerably high loads. Our experiments focus on correctness properties that are *parametric* w.r.t. to system components [7,19,54,48]: with this approach, monitors need not interact with one another and can reach verdicts independently. Verdicts are communicated by monitors to a central entity that records the expected number of verdicts in order to determine when the experiment can be stopped. The set of properties used in our benchmarks translate to monitors that loop continually to exert the maximum level of runtime overhead possible.

Fig. 6b shows the monitor synthesised from property ϕs, consisting of states Q0, Q1, the rejection state ✗, and inconclusive state ?. The rejection state corresponds to a *violation* of the property, *i.e.,* ff, whereas the *inconclusive* state is reached when the analysed trace events do not contain enough information to enable the monitor to transition to any other state. Both of these states are sinks, modelling the irrevocability of verdicts [24,26]. The modality [\*Slv* ?\*Req*] in property ϕ<sup>s</sup> corresponds to the transition between Q<sup>0</sup> and Q<sup>1</sup> in fig. 6b. The monitor follows this transition when it analyses the trace event pid1?d<sup>1</sup> exhibited by the slave with PID pid<sup>1</sup> when it receives data payload d<sup>1</sup> from the master; as a side effect, the transition binds the variable *Slv* to pid<sup>1</sup> and *Req* to d<sup>1</sup> in

(a) Inlined runtime monitors

(b) Synthesised monitor from property ϕ<sup>s</sup>

Fig. 6: Synthesised monitors instrumented with master and slave processes

state Q1. From Q1, the monitor transitions to Q<sup>0</sup> only when the event pid<sup>1</sup> !d<sup>2</sup> is analysed, where d<sup>2</sup> = d<sup>1</sup> + 1 and pid<sup>1</sup> is the slave PID (previously) bound to *Slv*. From Q<sup>0</sup> and Q1, the rejection state ✗ can be reached when a crash event is analysed. In the case of Q0, the transition to ✗ is followed for *any* crash event \_ (the wildcard \_ denotes the *anonymous* variable). By contrast, the monitor reaches ✗ from Q<sup>1</sup> *only* when the slave with PID pid<sup>1</sup> crashes, otherwise it transitions to the inconclusive state ?. Other transitions from Q<sup>0</sup> and Q<sup>1</sup> leading to ? follow a similar reasoning. Interested readers are encouraged to consult [25,6,5] for more information on the specification logic and monitor synthesis.

Synthetic Benchmarks We set the total number of slaves to n= 20k for *moderate* loads and n= 500k for *high* loads; Pr(send) =Pr(recv) is fixed at 0.9 as in sec. 3.1. These configurations generate ≈n×w×(work requests and responses)= 4M and 100M messages respectively to produce 8M and 200M analysable trace events per run. The pseudorandom number generator is seeded with a constant value and three experiment repetitions are performed for the Steady, Pulse and Burst load profiles (see fig. 3). A loading time of t=100s is used. Our results are summarised in figs. 7 and 8. Each chart in these figures plots the particular performance metric (*e.g.* memory consumption) for the system without monitors, *i.e.,* the *baseline*, together with the overhead induced by the RV monitors.

*Moderate loads.* Fig. 7 shows the plots for the system set with n = 20k. These loads are similar to those employed by the state-of-the-art frameworks to evaluate component-based runtime monitoring, *e.g.* [57,7,10,23,48] (ours are slightly higher). We remark that none of the benchmarks used in these works consider different load profiles: they either model load on a Poisson process, or fail to specify the kind of load used. In fig. 7, the execution duration chart (bottom right) shows that, regardless of the load profile used, the running time of each experiment is comparable to the baseline. With the moderate size of 20k slaves, the execution duration on its own does not give a detailed enough view of runtime overhead, despite the fact that our benchmarks provide a broad coverage in terms of the Steady, Pulse and Burst load profiles. This trend is mirrored in the scheduler utilisation plot (top left), where both baseline and monitored system induce a constant load of ≈ 17.5%. On this account, we deem these results to be *inconclusive*. By contrast, our three load profiles induce different overhead for the RT (bottom left), and, to a lesser extent, the memory consumption plots (top right). Specifically, when the system is subjected to a Burst load, it exhibits a surge in the RT for the baseline and monitored system alike, at ≈ 16k slaves. While this is not reflected in the consumption of memory, the Burst plots do exhibit a larger—albeit linear—rate of increase in memory when compared to their Steady and Pulse counterparts. The latter two plots once again show analogous trends, indicating that both Steady and Pulse loads exact similar memory requirements and exhibit comparable responsiveness under the respectable load of 20k slaves. Crucially, the data plots in fig. 7 *do not* enable us to confidently extrapolate our results. The edge case in the RT chart for Burst plots raises the question of whether the surge in the trend observed at ≈16k remains consistent

Fig. 7: Mean runtime overhead for master and slave processes (20 k slaves)

when the number of slaves goes beyond 20k. Similarly, although for a different reason, the execution duration plots do not allow us to distinguish between the overhead induced by monitors for different loads on this small scale—this occurs due to the *perturbations* introduced by the underlying OS (*e.g.* scheduling other processes, IO, *etc.*) that affect the sensitive time keeping of benchmarks.

*High loads.* We increase the load to n = 500k slaves to determine whether our benchmark set-up can adequately scale, and show how the monitored system performs under stress. The RT chart in fig. 8 indicates that for Burst loads (bottom left), the overhead induced by monitors *grows linearly* in the number of slaves. This contradicts the results in fig. 7, confirming our supposition that moderate loads may provide scant empirical evidence to extrapolate to general conclusions. However, the memory consumption for Burst loads (top right) exhibits similar trends to the ones in fig. 7. Subjecting the system to high loads renders discernible the discrepancy between the RT and memory consumption gradients for the Steady and Pulse plots that appeared to be similar under the moderate loads of 20k slaves. Considering the execution duration chart (bottom right of fig. 8) as the *sole* indicator of overhead could *deceivingly suggest* that runtime monitoring induces virtually identical overhead for the distinct load profiles of fig. 3. However, this erroneous observation is easily refuted by the memory consumption and RT plots that show otherwise. This stresses the merit of gathering multi-faceted metrics to assist in the interpretation of runtime overhead.

We extend the argument for multi-faceted views to the scheduler utilisation metric in fig. 8 that reveals a subtle aspect of our concurrent set-up. Specifically,

Fig. 8: Mean runtime overhead for master and slave processes (500 k slaves)

the charts show that while the execution duration, RT and memory consumption plots grow in the number of slave processes, scheduler utilisation stabilises at ≈ 22.7%. This is partly caused by the master-slave design that becomes susceptible to bottlenecks when the master is overloaded with requests [61]. In addition, the preemptive scheduling of the EVM [16] ensures that the master *shares* the computational resources of the same machine with the rest of the slaves. We conjecture that, in a distributed set-up where the master resides on a *dedicated* node, the overall system throughput may be further pushed. Fig. 8 also attests to the utility of having a benchmarking framework that scales considerably well to increase the chances of detecting potential trends. For instance, the evidence gathered earlier in fig. 7 could have misled one to assert that the RV tool under scrutiny scales poorly under Burst loads of moderate and larger sizes.

An OTS Application Use Case We evaluate the overheads induced by the RV tool under scrutiny using a third-party industry-strength web server called Cowboy [33], and show that the conclusions we draw are *in line* with those reported earlier for our synthetic benchmark results. Cowboy is written in Erlang and built on top of Ranch [34]—a socket acceptor pool for TCP protocols that can be used to develop custom network applications. Cowboy relies on Ranch to manage its socket connections, but delegates HTTP client requests to *protocol handlers* that are forked dynamically by the web server to handle each request independently. This architecture follows closely our master-slave set-up of sec. 2.1 which abstracts details such as TCP connection management and

Fig. 9: Mean overhead for synthetic and Cowboy benchmarks (20 k threads)

HTTP protocol parsing. We generate load on Cowboy using the popular stress testing tool JMeter [3] to issue HTTP requests from a dedicated machine residing on the same network where Cowboy is hosted. The latter machine is the one used in the experiments discussed earlier. To emulate the typical behaviour of web clients (*e.g.* browsers) that fetch resources via multiple HTTP requests, our Cowboy application serves files of various sizes that are randomly accessed by JMeter during the benchmark. In our experiments, we monitored fragments of the Cowboy and Ranch communication protocol used to handle client requests.

*Moderate loads.* Fig. 9 plots our results for *Steady* loads from fig. 7, together with the ones obtained from the Cowboy benchmarks; JMeter did not enable us to reproduce the Pulse and Burst load profiles. For our Cowboy benchmarks, we fixed the total number of JMeter request threads to 20k over the span of 100s, where each thread issued 100 HTTP requests. This configuration coincides with parameter settings used in the experiments of fig. 7. In fig. 9, the scheduler utilisation, memory consumption and RT charts (top, bottom left) show a correspondence between the baseline plots of our synthetic benchmarks and those taken with Cowboy and JMeter. This indicates that, for these metrics, our synthetic system model exhibits *analogous characteristics* to the ones of the OTS system, under the chosen load profile. The argument can be extended to the monitored versions of these systems which follow identical trends. We point out the similarity in the RT trends of our synthetic and Cowboy benchmarks, despite the fact that the latter set of experiments were conducted over a local network. This suggests that, for our single-machine configuration, the synthetic

master-slave benchmarks manage to adequately capture local network conditions. The gaps separating the plots of the two experiment set-ups stem from the implementation specifics of Cowboy and our synthetic model. This discrepancy in measurements also depends on the method used to gather runtime metrics, *e.g.* JMeter cannot sample the EVM directly, and measures CPU as opposed to scheduler utilisation. The deviation in execution duration plots (bottom right) arises for the same reason.

*High loads.* Our efforts to run tests with 500k request threads where stymied by the scalability issues we experienced with Cowboy and JMeter on our set-up.

# 4 Conclusion

Concurrent RV necessitates benchmarking tools that can *scale dynamically* to accommodate considerable load sizes, and are able to provide a *multi-faceted view* of runtime overhead. This paper presents a benchmarking tool that fulfils these requirements. We demonstrate its implementability in Erlang, arguing that the design is easily instantiatable to other actor frameworks such as Akka and Thespian. Our set-up emulates various system models through configurable parameters, and scales to reveal behaviour that emerges only when software is pushed to its limit. The benchmark harness gathers different performance metrics, offering a multi-faceted view of runtime overhead that, to wit, other state-of-the-art tools do not currently offer. Our experiments demonstrate that these metrics benefit the interpretation of empirical measurements: they increase visibility that may spare one from drawing insufficiently general, or otherwise, erroneous conclusions. We establish that—despite its synthetic nature—our master-slave model faithfully approximates the mean response times observed in realistic web server traffic. We also compare the results of our synthetic benchmarks against those obtained from a real-world use case to confirm that our tool captures the behaviour of this realistic set-up. It is worth noting that, while our empirical measurements of secs. 3.1 and 3.2 depend on the implementation language, our conclusions are transferrable to other frameworks, *e.g.* Akka and Play [42].

*Related work.* There are other less popular benchmarks targeting the JVM besides those mentioned in sec. 1. Renaissance [52] employs workloads that leverage the concurrency primitives of the JVM, focussing on the performance of compiler optimisations similar to DaCapo and ScalaBench. These benchmarks gather metrics that measure software quality and complexity, as opposed to metrics that gauge runtime overhead. The CRV suite [8] aims to standardise the evaluation of RV tools, and mainly focusses on RV for monolithic programs. We are unaware of RV-centric benchmarks for concurrent systems such as ours. In [43], the authors propose a queueing model to analyse web server traffic, and develop a benchmarking tool to validate it. Their model coincides with our master-slave set-up, and considers loads based on a Poisson process. A study of messagepassing communication on parallel computers conducted in [31] uses systems loaded with different numbers of processes; this is similar to our approach. Importantly, we were able to confirm the findings reported in [43] and [31] (sec. 3.1).

#### References



Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Certified Abstract Cost Analysis**

Elvira Albert<sup>1</sup>,<sup>2</sup> , Reiner H¨ahnle<sup>3</sup> , Alicia Merayo2(-) , and Dominic Steinh¨ofel<sup>3</sup>,<sup>4</sup>

<sup>1</sup> Instituto de Tecnolog´ıa del Conocimiento, Madrid, Spain

<sup>2</sup> Complutense University of Madrid, Madrid, Spain. ( amerayo@ucm.es) <sup>3</sup> Technische Universit¨at Darmstadt, Darmstadt, Germany

<sup>4</sup> CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany

**Abstract.** A program containing placeholders for unspecified statements or expressions is called an abstract (or schematic) program. Placeholder symbols occur naturally in program transformation rules, as used in refactoring, compilation, optimization, or parallelization. We present a generalization of automated cost analysis that can handle abstract programs and, hence, can analyze the impact on the cost of program transformations. This kind of relational property requires provably precise cost bounds which are not always produced by cost analysis. Therefore, we certify by deductive verification that the inferred abstract cost bounds are correct and sufficiently precise. It is the first approach solving this problem. Both, abstract cost analysis and certification, are based on quantitative abstract execution (QAE) which in turn is a variation of abstract execution, a recently developed symbolic execution technique for abstract programs. To realize QAE the new concept of a cost invariant is introduced. QAE is implemented and runs fully automatically on a benchmark set consisting of representative optimization rules.

#### **1 Introduction**

We present a generalization of automated cost analysis that can handle programs containing placeholders for unspecified statements. Consider the program Q ≡ "i =0; **while** (i < t) {P; i ++;}", where P is any statement not modifying i or t. We call P an *abstract statement*; a program like Q containing abstract statements is called *abstract program*. The (exact or upper bound) cost of executing P is described by a function acP(x) depending on the variables x occurring in P. We call this function the *abstract cost* of P. Assuming that executing any statement has unit cost and that t ≥ 0, one can compute the (abstract) cost of Q as 2 + t·(acP(x) + 2) depending on ac<sup>P</sup> and t. For any concrete instance of P, we can derive its concrete cost as usual and then obtain the concrete cost of Q simply by instantiating acP. In this paper, we define and implement an abstract cost analysis to infer abstract cost bounds. Our implementation consists of an automatic abstract cost analysis tool and an automatic certifier for the correctness of inferred abstract bounds. Both steps are performed with an approach called *Quantitative Abstract Execution* (QAE).

Fine, but what is this good for? Abstract programs occur in program transformation rules used in compilation, optimization, parallelization, refactoring,

<sup>©</sup> The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 24–45, 2021.

https://doi.org/10.1007/978-3-030-71500-7 2

etc.: Transformations are specified as rules over *program schemata* which are nothing but abstract programs. If we can perform cost analysis of abstract programs, we can *analyze the cost effect of program transformations*. Our approach is the *first method to analyze the cost impact of program transformations*.

*Automated Cost Analysis.* Cost analysis occupies an interesting middle ground between termination checking and full functional verification in the static program analysis portfolio. The main problem in functional verification is that one has to come up with a functional specification of the intended behavior, as well as with auxiliary specifications including loop invariants and contracts [21]. In contrast, termination is a generic property and it is sufficient to come up with a suitable term order or ranking function [6]. For many programs, termination analysis is vastly easier to automate than verification.<sup>1</sup>

Computation cost is not a generic property, but it is usually schematic: One fixes a class of cost functions (for example, polynomial) that can be handled. A cost analysis then must come up with parameters (degree, coefficients) that constitute a valid bound (lower, upper, exact) for all inputs of a given program with respect to a cost model (# of instructions, allocated memory, etc.). If this is performed bottom up with respect to a program's call graph, it is possible to *infer* a cost bound for the top-level function of a program. Such a cost expression is often *symbolic*, because it depends on the program's input parameters.

A central technique for inferring symbolic cost of a piece of code with high precision is *symbolic execution* (SE) [9, 25]. The main difficulty is to render SE of loops with symbolic bounds finite. This is achieved with *loop invariants* that generalize the behavior of a loop body: an invariant is valid at the loop head after arbitrarily many iterations. To infer sufficiently strong invariants automatically is generally an unsolved problem in functional verification, but much easier in the context of cost analysis, because invariants do not need to characterize functional behavior: it suffices that they permit to infer schematic cost expressions.

*Abstract Execution.* To infer the cost of program transformation *schemata* requires the capability of analyzing abstract programs. *This is not possible with standard SE*, because abstract statements have no operational semantics. One way to reason about abstract programs is to perform structural induction over the syntactic definition of statements and expressions whenever an abstract symbol is encountered. Structural induction is done in interactive theorem proving [7, 31] to verify, e.g., compilers. It is labor-intensive and not automatic. Instead, here we perform cost analysis of abstract programs via a recent generalization of SE called abstract execution (AE) [37,38]. The idea of AE is, quite simply, to symbolically execute a program containing abstract placeholder symbols for expressions and statements, just as if it were a concrete program. It might seem

<sup>1</sup> In theory, of course, proving termination is as difficult as functional verification. It is hard to imagine, for example, to find a termination argument for the Collatz function without a deep understanding of what it does. But automated termination checking works very well for many programs in practice.

counterintuitive that this is possible: after all, nothing is known about an abstract symbol. But this is not quite true: one can equip an abstract symbol with an *abstract* description of the behavior of its instances: a set of memory locations its behavior may depend on, commonly called *footprint* and a (possibly different) set of memory locations it can change, commonly called *frame* [21].

*Cost Invariants.* In automated cost analysis, one infers cost bounds often from loop invariants, ranking functions, and size relations computed during SE [3,11, 16, 40]. For *abstract* programs, we need a more general concept, namely a loop invariant expressing a *valid abstract cost bound* at the beginning of any iteration (e.g., 2 + i ∗ (acP(x) + 2) for the program Q above). We call this a *cost invariant*. This is an important technical innovation of this paper, increasing the modularity of cost analysis, because each loop can be verified and certified separately.

*Relational Cost Analysis.* AE allows specifying and verifying *relational* program properties [37], because one can express rule schemata. This extends to QAE and makes it possible, for the first time, to infer and to prove (automatically!), for example, the impact of program transformation on performance.

*Certification.* Cost annotations inferred by abstract cost analysis, i.e., cost invariants and abstract cost bounds, are automatically *certified* by a deductive verification system, extending the approach reported in [4] to abstract cost and abstract programs. This is possible because the specification (i.e., the cost bound) and the loop (cost) invariants are inferred by the cost analyzer—the verification system does not need to generate them.

To argue correctness of an abstract cost analysis is complex, because it must be valid for an infinite set of concrete programs. For this reason alone, it is useful to certify the abstract cost inferred for a given abstract program: during development of the abstract cost analysis reported here, several errors in abstract cost computation were detected—analysis of the failed verification attempt gave immediate feedback on the cause. We built a test suite of problems so that any change in the cost analyzer can be validated in the future.

Certification is crucial for the correctness of quantitative relational properties: The inferred cost invariants might not be precise enough to establish, e.g., that a program transformation does not increase cost for any possible program instance and run. This is only established at the certification stage, where relational properties are formally verified. *A relational setting requires provably precise cost bounds.* This feature is not offered by existing cost analysis methods.

# **2 QAE by Example**

We introduce our approach and terminology informally by means of a motivating example: *Code Motion* [1] is a compiler optimization technique moving a statement not affected by a loop from the beginning of the loop body to before the loop. This code transformation should preserve behavior provided the loop is executed at least once, but can be expected to improve computation effort, i.e. *quantitative* properties of the program, such as execution time and memory

#### Fig. 1: Motivating example on relational quantitative properties.

consumption: The moved code block is executed just once in the transformed context, leading to less instructions (less energy consumed) and, in case it allocates memory, less memory usage. In the following we subsume any quantitative aspect of a program under the term *cost* expressed in an unspecified *cost model* with the understanding that it can be instantiated to specific cost measures, such as number of instructions, number of allocated bytes, energy consumed, etc.

To formalize code motion as a transformation rule, we describe in- and output of the transformation *schematically*. Fig. 1 depicts such a schema in a language based on Java. An *Abstract Statement* (AS) with identifier *Id*, declared as "\**abstract statement** *Id*;", represents an arbitrary concrete statement. It is obviously unsafe to extract arbitrary, possibly non-invariant, code blocks from loops. For this reason, the AS P in question has a *specification* restricting the allowed behavior of its instances. For compatibility with Java we base our specification language on the *Java Modeling Language* (JML) [27]. Specifications are attached to code via structured comments that are marked as JML by an "@" symbol. JML keyword "**assignable**" defines the memory locations that may occur in the frame of an AS; similarly, "**accessible**" restricts the footprint. Fig. 1 contains further keywords explained below.

Input to QAE is the abstract program to analyze, including annotations (highlighted in light gray in Fig. 1) that express restrictions on the permitted instances of ASs. In addition to the frame and footprint, the *cost footprint* of an AS, denoted with the keyword "**cost footprint**", is a subset of its footprint listing locations the cost expressions in AS instances may depend on. In Fig. 1, the cost footprint of AS Q excludes accessible variables i and y. Annotations highlighted in dark gray are *automatically inferred* by abstract cost analysis and are input for the certifier. As usual, loop invariants (keyword "**loop invariant**") are needed to describe the behavior of loops with symbolic bounds. The loop invariant in Fig. 1 allows inferring the final value t of loop counter i after loop termination. To prove termination, the loop *variant* (keyword "**decreases**") is inferred.

So far, this is standard automated cost analysis [3]. The ability to *infer automatically* the remaining annotations represents our main contribution: Each AS P has an associated *abstract cost* function parametric in the locations of its footprint, represented by an abstract cost symbol acP. The symbol ac<sup>p</sup> (t,w) in the "**assert**" statement in Fig. 1 can be instantiated with any concrete function parametric in t, w being a valid cost bound for the instance of P. For example, for the instantiation "P ≡ x=t+1;" the constant function ac<sup>P</sup> (t,w) = 1 is the correct *exact* cost, while ac<sup>P</sup> (t,w) = t with t ≥ 1 is a correct *upper bound* cost.

As pointed out in Sect. 1 we require *cost invariants* to capture the cost of each loop iteration. They are declared by the keyword "**cost invariant**". To generate them, it is necessary to infer the *cost growth* of abstract programs that bounds the number of loop iterations executed so far. In Sect. 4 we describe automated inference of cost invariants including the generation of cost growth for all loops. Our technique is compositional and also works in the presence of nested loops.

The QAE framework can express and prove quantitative relational properties. The assertions in the last lines in Fig. 1 use the expression \**cost** referring to the total accumulated cost of the program, i.e., the quantitative *postcondition*. We support quantitative relational postconditions such as \**cost 1** ≥ \**cost 2**, where \**cost 1**, \**cost 2** refer to the total cost of the original (on the left) and transformed (on the right) program, respectively. To prove relational properties, one must be able to deduce *exact* cost invariants for loops such that the comparison of the invariants allows concluding that the programs from which the invariants are obtained fulfill the proven relational property. Otherwise, over-approximation introduced by cost analysis could make the relation for the postconditions hold, while the relational property does not necessarily hold for the programs.

To obtain a formal account of QAE with correctness guarantees we require a mathematically rigorous semantic foundation of abstract cost. This is provided in the following section.

# **3 (Quantitative) Abstract Execution**

Abstract Execution [37, 38] extends symbolic execution by permitting abstract statements to occur in programs. Thus AE reasons about an *infinite* set of concrete programs. An abstract program contains at least one AS. The semantics of an AS is given by the set of concrete programs it represents, its set of *legal instances*. To simplify presentation, we only consider normally completing Java code as instances: an instance may not throw an exception, break from a loop, etc. Each AS has an *identifier* and a specification consisting of its frame and footprint. Semantically, instances of an AS with identifier P may at most write to memory locations specified in P's frame and may only read the values of locations in its footprint. All occurrences of an AS with the *same identifier* symbol have the same legal instances (possibly modulo renaming of variables, if variable names in frame and footprint specifications differ). For example, by

#### //@ **assignable** x,y; //@ **accessible** y, z; \**abstract statement** P;

we declare an AS with identifier "P", which can be instantiated by programs that write at most to variables x and y, while only depending on variables y and z. The program "x=y; y=17;" is a legal instance of it, but not "x=y; y=w;", which accesses the value of variable w not contained in the footprint.

We use the shorthand P(x, y :≈ y, z) for the AS declaration above. The lefthand side of ":≈" is the frame, the right-hand side the footprint. Abstract programs allow expressing a second-order property such as "all programs assigning at most x, y while reading at most y, z leave the value of i unchanged". In *Hoare triple* format (where i<sup>0</sup> is a fresh constant not occurring in P):

$$\{\mathbf{i} \doteq i\_0\} \mathsf{P}(\mathbf{x}, \mathbf{y} \mathrel{\mathop{\mathbf{x} \colon \mathbf{y}}} \mathbf{z}) ; \{\mathbf{i} \doteq i\_0\} \qquad \mathsf{T}(\ast)$$

#### **3.1 Abstract Execution with Abstract Cost**

We extend the AE framework [37,38] to QAE by adding *cost specifications* that extend the specification of an AS with an annotated *cost expression*. An abstract cost expression is a function whose value may depend on any memory location in the footprint of the AS it specifies. This location set is called the *cost footprint*, specified via the **cost footprint** keyword (see Fig. 1), and must be a subset of the footprint of the specified AS. The cost footprint for the program in (∗) might be declared as "{*z*}". It implicitly declares the abstract function ac<sup>P</sup> (*z* ) that could be instantiated to, say, quadratic cost "*z <sup>2</sup>* ".

**Definition 1 (Abstract Program).** *A pair* P = (*abstrStmts*, p*abstr* ) *of a set of AS declarations abstrStmts* = ∅ *and a program fragment* p*abstr containing exactly those ASs is called* abstract program*. Each AS declaration in abstrStmts is a pair* (P(*frame* :≈ *footprint*), ac<sup>P</sup> (*costFootprint*))*, where* P *is an identifier; frame, footprint , and costFootprint* ⊆ *footprint are location sets.*

*A concrete program fragment* p *is a* legal instance *of* P *if it arises from substituting concrete cost functions for all* ac<sup>P</sup> *in abstrStmts , and concrete statements for all* P *in abstrStmts , where (i) all ASs are instantiated legally, i.e., by statements respecting their frame, footprint, and cost function, and (ii) all ASs with the same identifier are instantiated with the same concrete program. The semantics* -<sup>P</sup> *consists of all its legal instances.*

The abstract program consisting of only AS P in (∗) with cost footprint "{z}" is formally defined as: {(P(x, <sup>y</sup> :<sup>≈</sup> <sup>y</sup>, <sup>z</sup>), ac<sup>P</sup> (z))} , P; . The program "P<sup>0</sup> <sup>≡</sup> i =0; **while** (i <z) {x = z; i ++;}" with cost function "ac<sup>P</sup> (z)=3 · z + 2" is a legal instance: it respects frame, footprint, and cost footprint, as well as the cost function, that (assuming <sup>z</sup> <sup>≥</sup> 0) can be obtained by static cost analysis of <sup>P</sup><sup>0</sup>.

By encoding the semantics of abstract programs in a program logic [38, Sect. 4.2] one can statically verify whether an instance is legal. It may require auxiliary specifications (invariants, contracts) of the concrete code. The property is undecidable, but can be proven automatically in many cases, see [38] for a discussion. A first implementation of such a check is part of the REFINITY tool (see [36], also https://www.key-project.org/REFINITY/).

#### **3.2 Cost of Abstract Programs**

Finitely executing a concrete program p starting in a state s<sup>0</sup> = (p, σ0) with an initial assignment σ<sup>0</sup> of p's program variables results in a finite trace of the form t ≡ s<sup>0</sup> <sup>c</sup><sup>1</sup> −→ ... <sup>c</sup><sup>n</sup> −→ <sup>s</sup>n. Each state <sup>s</sup><sup>i</sup> = (pi, σi) consists of a program counter <sup>p</sup><sup>i</sup> (the remaining program to execute) and a store σ<sup>i</sup> (the current variable assignment); each transition s<sup>i</sup> ci+1 −−−→ s<sup>i</sup>+1 updates s<sup>i</sup> to s<sup>i</sup>+1 according to the effect of executing command c<sup>i</sup>+1 defined in the semantics of the programming language. A *complete* trace corresponds to a terminating execution, i.e., s<sup>n</sup> = (, σn), where is the empty program and σ<sup>n</sup> the resulting final variable assignment.

The cost of a program can be computed based on execution traces. To allow arbitrary quantitative properties, we work on a generic *cost model* M that assigns cost values to programming language instructions. We will compute the cost of a trace t, denoted M(t), by summing up the costs of the executed instructions. A straightforward measure is the number of executed instructions Minstr: In this cost model, instructions like "x=1;", the evaluation of the loop guard, etc., all are assigned cost 1. For example, the cost of the complete trace of "**while** (i >0) i−−;" when started with an initial store assigning the value 3 to i is 7, because "i −−;" is executed three times and the guard is evaluated four times. This can be generalized to *symbolic* execution: Executing the same program with a *symbolic* store assigning to i a symbolic initial value i<sup>0</sup> ≥ 0 produces traces of cost 2 · i<sup>0</sup> + 1. The cost of *abstract programs*, i.e., the generalization to QAE, is defined similarly: By generalizing not merely over all initial stores, but also over all concrete instances of the abstract program.

**Definition 2 (Abstract Program Cost).** *Let* M *be a cost model. Let an integer-valued expression* c<sup>P</sup> *consist of scalar constants, program variables, and abstract cost symbols applied to constants and variables. Expression* c<sup>P</sup> *is the* cost of an abstract program P w.r.t. M *if for all concrete stores* σ *and instances* <sup>p</sup> <sup>∈</sup> -<sup>P</sup> *such that* <sup>p</sup> *terminates with a complete trace* <sup>t</sup> *of cost* <sup>M</sup>(t) *when executed in* σ*,* c<sup>P</sup> *evaluates to* M(t) *when interpreting variables according to* σ*, and abstract cost functions according to the instantiation step leading to* p*. The instance of* c<sup>P</sup> *using the concrete store* σ *is denoted* c<sup>P</sup> *(*σ*).*

*Example 1.* We test the cost assertion in the last lines of the left program in Fig. 1 by computing the cost of a trace obtained from a fixed initial store and instances of P, Q. We use the cost model Minstr and an initial store that assigns 2 to t and 0 to all other variables. We instantiate P with "x=2∗t;" and Q with "y=i; y++;". Consequently, the abstract cost functions ac<sup>P</sup> (t,w) and ac<sup>Q</sup> (t, z) are instantiated with 1 and 2, respectively. Evaluating the postulated abstract program cost 2 + t · (2 + ac<sup>P</sup> (t,w) + ac<sup>Q</sup> (t, z)) for the concrete store and AS instantiations results in 2+ 2 ·(2+ 1+ 2) = 12. Consequently, the execution trace should contain 12 transitions, which is the case.

### **3.3 Proving Quantitative Properties with QAE**

There are two ways to realize QAE on top of the existing functional verification layer provided by the AE framework [37, 38]: (i) provide a "cost" extension to the program logic and calculus underlying AE; (ii) translate non-functional (cost) properties to functional ones. We opt for the second, as it is less prone to introduce soundness issues stemming from the addition of new concepts to the existing framework. It is also faster to realize and allows early testing.

The translation consists of three elements: (a) A global "ghost" variable "cost" (representing keyword "\**cost**") for tracking accumulated cost; (b) explicit encoding of a chosen cost model by suitable ghost setter methods that update this variable; (c) functional loop invariants and method postconditions expressing cost invariants and cost postconditions.

Regarding item (c), we support three kinds of cost specification. These are, descending in the order of their strength: *exact*, *upper bound*, and *asymptotic* cost. At the analysis stage, it is usually impossible to determine the best match. For this reason, there is merely one **cost invariant** keyword, not three. However, when translating cost to functional properties, a decision has to be made. A natural strategy is to start with the strongest kind of specification, then proceed towards the weaker ones when a proof fails.

An exact cost invariant has the shape "cost == *expr*", an upper bound on the invariant cost is specified by "cost <= *expr*"; asymptotic cost is expressed by the idiom "asymptotic(cost ) <= asymptotic(*expr* )". The function "asymptotic" abstracts from constant symbols in the argument. For example, the (exact) cost postcondition of the abstract program on the right in Fig. 1 is:

cost == 2 + ac<sup>P</sup> (t,w) + t · (ac<sup>Q</sup> (t, z) + 2) (†)

Asymptotic cost would be expressed as asymptotic(cost) <= asymptotic(2 + ac<sup>P</sup> (t,w) +t·(ac<sup>Q</sup> (t, z) + 2)) where the right-hand side of the equation is equivalent to asymptotic(ac<sup>P</sup> (t,w) + t · (ac<sup>Q</sup> (t, z))).

Listing 2 shows the result of translating the cost invariant in Fig. 1 to a functional loop invariant (highlighted lines), using cost model Minstr in ghost setters and postconditions of AS ("**ensures**" clauses). ASs P, Q must include the ghost variable "cost" in their frame, because they update its value. The keyword \**before** in the postcondition of an AS refers to the value a variable had just before executing the AS. In loops we use "inner" cost variables "iCost" tracking the cost inside the loop. When the loop terminates, we add the final value of "iCost" to "cost". After every evaluation of the guard of the loop, the cost is incremented accordingly. Using the translation in Listing 2 of the inferred annotations in Fig. 1, the AE system proves cost postcondition (†) automatically.

Apart from the translation of inferred quantitative annotations to functional AE specifications, we implemented the axiomatization of the asymptotic function and extended the AE system's *proof script* language. This made it possible to define a highly automated proof strategy for non-linear arithmetic problems generated by some cost analysis benchmarks.

#### **4 Abstract Cost Analysis**

Recall from Sect. 2 that for automatic cost certification we need to infer annotations for abstract cost invariants and cost postconditions. To achieve this, we

Listing 2: Translation of cost model and cost invariants to AE.

leverage a cost analysis framework for concrete programs to the abstract setting. The presentation is structured as follows: Sect. 4.1 defines the notion of an abstract cost relation system (ACRS) used in cost analysis for the abstract setting. Sect. 4.2 details how to generate automatically inductive cost invariants for abstract programs from ACRSs. Sect. 4.3 tells how to generate cost postconditions used to prove relational properties and required to handle nested loops.

#### **4.1 Inference of Abstract Cost Relations**

There are two main cost analysis approaches: those using recurrence equations in the style of Wegbreit [39], and those based on type systems [14, 24]. Our formalization is based on the first kind, but the main ideas for extending the framework to abstract programs would be also applicable to the second. The key issue when extending a recurrences-based framework to the abstract setting is the notion of *abstract cost relation* for loops which generalizes the concept of cost recurrence equations for a loop to an abstract setting. We start with notation for loops and technical details on assumed size relations.

*Loops.* In our formalization we consider while-loops containing n abstract statements and m non-abstract statements. Non-abstract statements include any concrete instruction of the target language (arithmetic instructions, conditionals, method calls, . . . ). We assume loops L have the general outline dis-

```
while (G) {
  //@ accessible r1,1,...,r1,hr1
  //@ assignable w1,1,...,w1,hw1
  //@ cost footprint c1,1,...,c1,hc1
  \abstract statement A1;
  non abstract statement N1;
  ...
}
```
played on the right. Each abstract statement has a frame specification, abstract and non-abstract statements may appear in any order, either might be empty.

*Size relations.* We assume that for each loop sets of *size constraints* have been computed. These sets capture the size relation among the variables in the loop upon exit (called *base case*, denoted ϕB), and when moving from one iteration to the next (denoted ϕ<sup>I</sup> ). ASs are ignored by the size analysis. While this would be unsound in general, it will be correct under the requirements we impose in Def. 4 and with the handling of ASs in Def. 3. Size relations are available from any cost analyzer by means of a static analysis [13] that records the effect of concrete program statements on variables and propagates it through each loop iteration. In our examples, since we work on integer data, size analysis corresponds to a value analysis [10] tracking the value of the integer variables.<sup>2</sup>

*Example 2.* The size relations for the loop on the left in Fig. 1 are ϕ<sup>B</sup> = {i ≥ t} and ϕ<sup>I</sup> = {i < t, i = i + 1}. ϕ<sup>B</sup> is inferred from the loop guard and ϕ<sup>I</sup> from the guard and the increment of i (primed variables refer to the value of the variable after the loop execution).

Based on pre-computed size relations, we define the cost of executing a loop by means of an *abstract cost relation system* (ACRS). This is a set of cost equations characterizing the abstract cost of executing a loop for any input with respect to a given cost model M. Cost equations consist of a cost expression governed by size constraints containing applicability conditions for the equation (like i < t in ϕ<sup>I</sup> above) and size relations between loop variables (like i = i + 1 in ϕ<sup>I</sup> ).

**Definition 3 (Abstract Cost Relation System).** *Let* L *be a loop as above with* n *abstract and* m *non-abstract statements. Let* x *be the set of variables accessed in* L*. Let* ϕ<sup>I</sup> *,* ϕ<sup>B</sup> *be sound size relations for* L*, and* M *a cost model. The ACRS for* L *is defined as the following set of cost equations:*

$$C(\overline{x}) = \mathsf{C\_B} \\ C(\overline{x}) = \sum\_{j=1}^{n} \mathsf{ac\_j} \left( c\_{j,1}, \dots, c\_{j,h\_{cj}} \right) + \sum\_{i=1}^{m} \mathsf{C\_{\mathbb{H\_i}}} + C(\overline{x}'), \ \varphi\_I$$

*where:*


Ignoring the abstract statements, one can apply a complete algorithm for cost relation systems [6] to an ACRS to obtain automatically a *linear* <sup>3</sup> ranking function f for loop L: f is a linear, non-negative function over x that decreases strictly at every loop iteration. Function f yields directly the "//@ **decreases** f;" annotation required for QAE.

As in Sect. 3, the definition of ACRS assumes a generic cost model M and uses C to refer in a generic way to cost according to M. For example, to infer the number of executed steps, C is set to 1 per instruction, while for memory usage C records the amount of memory allocated by an instruction.

<sup>2</sup> For complex data structures, one would need heap analyses [35] to infer size relations.

<sup>3</sup> There exist (more expensive) algorithms to obtain also polynomial ranking functions [5] but for the sake of efficiency we are not using them in our system.

*General Case of ACRS.* The definition of ACRS was simplified for presentation. The following generalizations, not requiring any new concept, are possible: (1) We assume an ACRS for a loop has only two equations, one for the base case (the guard G does not hold) and one for the iterative case (G holds). In general, there might be more than one equation for the base case, e.g., if the guard involves multiple conditions and the cost varies depending on the condition that holds on the exit. Similarly, there might be multiple equations in the iterative case, e.g., if the loop body contains conditional statements and each iteration has different cost depending on the taken branch. This issue is orthogonal to the extension to abstract cost. (2) A loop might contain method calls that in turn contain ASs. In absence of recursion, such calls can be inlined. For recursive methods, it is possible to compute the call graph and solve the equations in reverse topological order such that the abstract cost of the (inner) method calls is obtained first and then inserted into the surrounding equations. (3) The cost of code fragments not part of any loop (before, after, and in between loops) is defined as well by abstract cost equations accumulating the cost of all instructions these fragments include, just as for concrete programs. This aspect does not require changes to the framework for concrete programs, so we do not formalize it, but just illustrate it in the next example.

*Example 3.* The ACRSs of the programs in Fig. 1 are (left program above line, right program below):

Cbefore(t, x, w, y, z) = cbefore + C<sup>w</sup><sup>0</sup> (i,t, x, w, y, z), {i = 0} C<sup>w</sup><sup>0</sup> (i,t, x, w, y, z) = c<sup>B</sup>w<sup>0</sup> , {i ≥ t} C<sup>w</sup><sup>0</sup> (i,t, x, w, y, z) = c<sup>w</sup><sup>0</sup> + ac<sup>P</sup> (t, w) + ac<sup>Q</sup> (t, z) + C<sup>w</sup><sup>0</sup> (i - ,t, , w, , z), {i - = i + 1, i < t} Cafter(t, x, w, y, z) = cafter + ac<sup>P</sup> (t, w) + C<sup>w</sup><sup>1</sup> (i,t, , w, y, z), {i = 0} C<sup>w</sup><sup>1</sup> (i,t, x, w, y, z) = c<sup>B</sup>w<sup>1</sup> , {i ≥ t} C<sup>w</sup><sup>1</sup> (i,t, x, w, y, z) = c<sup>w</sup><sup>1</sup> + ac<sup>Q</sup> (t, z) + C<sup>w</sup><sup>1</sup> (i - ,t, x, w, , z), {i -= i + 1, i < t}

Notation c refers to the generic cost that can be instantiated to a chosen cost model M. Cost equation Cbefore for the first program is composed of the instructions appearing before the loop is cbefore plus the cost of executing the while loop C<sup>w</sup><sup>0</sup> . The size constraint fixes the initial value of i. Following Def. 3, there are two equations corresponding to the base case of the loop and executing one iteration, respectively. Observe that assignable variables in ASs have unknown values in the ACRS (according to item (6) in Def. 3). Program *after* has a similar structure. A ranking function for both loops is t − i which is used to generate the annotation "//@ **decreases** t−i;" inserted just before each loop in Fig. 1.

To guarantee soundness of abstract cost analysis, it is mandatory that (i) no AS in the loop modifies any of the variables that influence loop cost, i.e., they do not *interfere with cost*, and (ii) the cost of the AS in the loop is independent of the variables modified in the loop. We call the latter ASs *cost neutral*. The first requirement is guaranteed by item (6) in Def. 3, because the value of assignable variables is "forgotten" in the equations. It is implemented, as usual in static analysis, by using a name generator for *fresh* variables. If cost depends on assignable variables in an AS, then the ACRS will not be solvable (i.e., the analysis returns "unbound cost"). The ACRS in the example contains " " in equations that do not prevent solvability of the system nor its evaluation, because they do not interfere with cost. However, if we had "forgotten" a cost-relevant variable (such as t), we would be unable to solve or evaluate the equations: without knowing t the equation guard is not evaluable. Requirement (ii) is ensured by the following definition ensuring that variables in the cost footprint are not modified by other statements in the loop.

#### **Definition 4 (Cost neutral AS).** *Given a loop* L*, where*


L *is a loop with* cost neutral *ASs if, for all* A ∈ Abstr(L)*, it is the case that* (W(L) ∪ *Frame*(Abstr(L))) ∩ *CostFootprint*(*A*) = ∅*.*

The definition above constitutes a sufficient, but not necessary criterion that could be tightened by a more expensive analysis. For instance, our framework easily extends to allow conditions in the cost footprint that the concretizations of the AS must fulfill. In our example, the cost footprint might include condition i ≥ i, where i is the value of i after executing the AS. This permits the abstract statement to modify i provided it does not decrease its value. Thus, the AS is not cost neutral, but the upper bound remains sound. The formalization of this generalization is left to future work.

*Example 4.* It is easy to check that both loops in Fig. 1 have cost neutral ASs. On the left: W(L) = {i}, *Frame*({P, Q}) = {x, y}, *CostFootprint*(P) = {t,w}, and *CostFootprint*(Q) = {t, z}, so (W(L) ∪ *Frame*({P, Q})) ∩ *CostFootprint*(P) = ∅, and (W(L)∪*Frame*({P, Q}))∩*CostFootprint*(Q) = ∅. The program on the right is checked analogously.

Given a program P with variables x and ACRS with initial equation Cini(x). We denote by eval(Cini(x), σ0) the evaluation of the ACRS for a given initial assignment σ<sup>0</sup> of the variables. This is a standard evaluation of recurrence equations performed by instantiating the right-hand side of the equations with the values of the variables in σ<sup>0</sup> and checking the satisfiability of the size constraints (if the expression being checked or accumulated contains " ", the evaluation returns "unbound"). As usual, the process is repeated until an equation without calls is reached.

*Example 5.* Consider the ACRS of the left program in Fig. 1 with variables (t, x,w, y, z), initial state σ<sup>0</sup> = (2, 0, 0, 0, 0), and cost model Minst (thus cbefore, c<sup>B</sup>w<sup>0</sup> and c<sup>w</sup><sup>0</sup> take values 1, 1 and 2 respectively). The evaluation of the ACRS results in eval(Cini(t, x,w, y, z),(2, 0, 0, 0, 0)) = 6 + 2 · acP(2, 0) + 2 · acQ(2, 0).

The following theorem states soundness of the ACRS obtained by applying Def. 3 provided that all loops satisfy Def. 4.

**Theorem 1 (Soundness of ACRS).** *Let* M *be a cost model and* P *an abstract program whose loops satisfy Def. 4. Let* c<sup>P</sup> *be the abstract cost of* P *defined as in Definition 2. Let* Cini *be the initial equation for the ACRS obtained by Def. 3. For any initial state of the variables* <sup>σ</sup><sup>0</sup> <sup>∈</sup> <sup>Z</sup><sup>n</sup>m*, it holds that* c<sup>P</sup> (σ0) ≤ eval(Cini(x), σ0)*.*

#### **4.2 From ACRS to Abstract Cost Invariants**

Example 5 shows that ACRSs are evaluable for concrete instances. However, to enable automated QAE, we need to obtain from them *closed-form* cost invariants and postconditions, i.e., non-recursive expressions. We introduce the novel concept of *abstract cost invariant* (ACI) that enables automated, inductive proofs over cost in a deductive verification system. The crucial difference to (non-inductive) cost postconditions as inferred by existing cost analyzers is that ACIs can be proven inductively for each loop iteration. Hence, they integrate naturally into deductive verification systems that use loop invariants [21].

In contrast to ACIs, postconditions provide a bound for the cost *after* execution of the *whole* loop they refer to. Typically, a postcondition bound for a loop has the form max iter ∗ max cost + max base, where max iter is the maximal number of iterations of the loop, max cost is the maximal cost of any loop iteration, and max base is the maximal cost of executing the loop with no iterations. Instead, an ACI has the form *growth* ∗max cost+max base, where *growth* counts how many times the loop has been executed and hence provides a bound after *each* loop iteration. The challenge is to design an automated technique that infers *growth*. We propose to obtain it from the ranking function:

**Definition 5 (Growth).** *Given a loop with ranking function* F = c+ <sup>i</sup> a<sup>i</sup> ·vi*, where* c *and* v<sup>i</sup> *are the constant and variable parts of the function, respectively, and* a<sup>i</sup> *are constant coefficients. If we denote with* v<sup>0</sup> <sup>i</sup> *the initial value of variable* v<sup>i</sup> *before entering the loop, then growth* = <sup>i</sup> a<sup>i</sup> · v0 <sup>i</sup> − v<sup>i</sup> *.*



We can now define the concept of ACI that relies on abstract cost relations defined in Sect. 4.1 and growth as defined above.

**Definition 6 (Abstract Cost Invariant).** *Given an ACRS as in Def. 3 and its growth as in Def. 5, an* abstract cost invariant *is defined as follows:* cinv(x) = C<sup>B</sup> max+*growth* · <sup>n</sup> <sup>j</sup>=1 ac<sup>j</sup> cj,1,...,cj,hcj + <sup>m</sup> <sup>i</sup>=1 CNi max *where* C<sup>B</sup> max *stands for the maximal value that the expression* C<sup>B</sup> *can take under the constraints* ϕB*, and* CNi max *the maximal value of* CNi *under* ϕ<sup>I</sup> *. We generate the annotation "*//@ **cost invariant** cinv(x);*".*

To obtain the maximal cost of a cost expression under a set of constraints, we use existing maximization procedures [5].

From Def. 6 we obtain ACIs as closed-form abstract cost expressions of the form abexpr = cexpr | ac | abexpr<sup>1</sup> + abexpr<sup>2</sup> | abexpr<sup>1</sup> ∗ abexpr<sup>2</sup> where ac represents an abstract cost function as defined in Sect. 3.1 and cexpr is a concrete cost expression. The definition above yields linear bounds, however, the extension to infer postconditions in the subsequent section leads to polynomial expressions (of arbitrary degree).<sup>4</sup>

*Example 7 (Abstract Cost Invariant).* Consider the first loop in Example 6 (where *growth* = i) with the following frame and footprint:

#### //@ **assignable** j; **accessible** i , t, j , k; **cost footprint** k;

Using Minstr, the evaluation of the loop guard and the increase of i both have unit cost, so the ACRS is:

$$\begin{array}{ll}C(\mathbf{i},\mathbf{t},\mathbf{j},\mathbf{k})=1\\C(\mathbf{i},\mathbf{t},\mathbf{j},\mathbf{k})=\mathbf{a}\mathbf{c}\_{\mathsf{P}}\left(\mathbf{k}\right)+2+C(\mathbf{i}',\mathbf{t},\mathbf{i},\mathbf{k})\quad\left\{\mathbf{i}'=\mathbf{i}+1,\,\mathbf{i}<\mathbf{t}\right\}\end{array}$$

The value of the assignable variable j in the recursive call is "forgotten" (item (6) in Def. 3), but this information loss does not affect solvability of the ACRS. We obtain the following ACI: "//@ **cost invariant** 1+i ∗ (2 + acP(k));".

*Example 8 (Upper Bound Abstract Cost Invariant).* Sometimes an ACI is overapproximating cost, resulting in an *upper bound ACI*. To illustrate this, we add an instruction that creates an array of nonconstant size "i" to the program in Example 7 and measure memory consumption instead of instruction count.

$$\text{while } (\mathbf{i} < \mathbf{t}) \{ \\\mathbf{a} = \mathbf{new} \text{ int} [\mathbf{i}]; \\\text{// @ } \mathbf{assigable j}; \\\text{// @ } \mathbf{accesible i}, \mathbf{t}, \mathbf{j}, \mathbf{a}, \mathbf{k}; \\\text{// @ } \mathbf{cost}. \text{footprint } \mathbf{k}; \\\mathbf{abstract\textquotesingle}, \mathbf{statement\textquotesingle P;} \\\text{//} ++;$$

The resulting ACRS thus accumulates cost "i" at each iteration, plus the memory consumed by the abstract statement:

$$\begin{cases} C(\mathbf{i}, \mathbf{t}, \mathbf{j}, \mathbf{k}) = 0, & \{\mathbf{i} \ge \mathbf{t}\}, \\ C(\mathbf{i}, \mathbf{t}, \mathbf{j}, \mathbf{k}) = \mathbf{a}\mathbf{c}\boldsymbol{\rho}\ (\mathbf{k}) + \mathbf{i} + C(\mathbf{i}', \mathbf{t}, \mathbf{i}, \mathbf{k}), & \{\mathbf{i}' = \mathbf{i} + 1, \mathbf{i} < \mathbf{t}\} \end{cases}$$

Now, maximizing the expression CN1 = i under {i = i + 1, i < t} results in CN1 max <sup>=</sup> <sup>t</sup>−1 and upper bound ACI "//@ **cost invariant** <sup>i</sup> <sup>∗</sup> (t <sup>−</sup> 1 + acP(k));".

Let c<sup>L</sup> denote the abstract cost of executing a loop L (in analogy to c<sup>P</sup> in Def. 2, but considering only loop L rather than the whole program P). We denote by c<sup>I</sup> the portion of the cost in c<sup>L</sup> up to the execution of iteration I.

**Proposition 1.** *Let* L *be a loop with variables* x *satisfying Def. 4,* cinv(x) *its ACI, and* <sup>σ</sup><sup>I</sup> <sup>∈</sup> <sup>Z</sup><sup>n</sup><sup>m</sup> *be the store after performing iteration* <sup>I</sup> *of* <sup>L</sup>*. Then the following holds: (1)* cinv(x) *is true on entering the loop; (2)* c<sup>I</sup> (σ<sup>I</sup> ) ≤ cinv(σ<sup>I</sup> )*.*

<sup>4</sup> As our approach is based on a recurrences-based framework [39] that works for exponential and logarithmic expressions, the results in this section generalize to these expressions. However, the AE deductive verification system is not able to deal with them automatically at the moment, so we skip these expressions in our account.

#### **4.3 From Cost Invariants to Postconditions**

To handle programs with nested loops and to prove relational properties it is necessary to infer *cost postconditions* for abstract programs. For nested loops the cost postcondition states the abstract cost after complete execution of the inner loop and it is used to compute the invariant of the outer loop. For relational properties, the cost postconditions of two abstract programs are compared. Cost postconditions for concrete programs are obtained by upper bound solvers (e.g., COSTA [3], CoFloCo [16], AProVE [17]) that compute *max iter* , an upper bound on the number of iterations that a loop performs. To do so, one relies on ranking functions. We do this as well, but generalize the computation of postconditions to abstract programs. The cost postcondition is obtained by substituting growth by max iter in the formula of cinv(x) in Def. 6 as follows.

**Definition 7 (Cost Postcondition).** *Let* L *be a loop, max iter be an upper bound on the number of iterations of* L*. Given the ACRS for* L *in Def. 3, we infer the cost postcondition for* L *as*

$$post(\overline{x}) = \mathsf{C\_{\mathtt{B}}} + max\\_iter(\overline{x}) \cdot \left(\sum\_{j=1}^{n} \mathsf{ac\_{j}}\left(c\_{j,1}, \ldots, c\_{j,h\_{cj}}\right) + \sum\_{i=1}^{m} \mathsf{C\_{\mathtt{R}}}\right),$$
  $and\ generates\ the\ annotation\ \left\|\left\|\left\|\left\|\left\|\left.\mathbf{assert}\right\|\right\|\right\|\right\|\right\|\right.$ 

To infer the postcondition for a complete abstract program, we take the sum of all *cost postconditions* of its top-level loops plus the cost of the non-iterative fragments. Fig. 1 shows the cost postconditions for our running example obtained by replacing the growth i of the invariant with the bound t on the loop iterations and requiring t ≥ 0. The generation of inductive ACIs for nested loops uses the cost postcondition of inner loops to compute the invariants of the outer ones. The following theorem states soundness of cost postconditions:

**Theorem 2.** *Let* L *be a loop over variables* x *satisfying Def. 4 and post*(x) *its cost postcondition. Let* <sup>σ</sup><sup>L</sup> <sup>∈</sup> <sup>Z</sup><sup>m</sup><sup>n</sup> *be the store upon termination of* <sup>L</sup>*. Then* cL(σL) ≤ *post*(σL)*.*

## **5 Experimental Evaluation**

We implemented a prototype of our approach downloadable from https://tinyurl. com/qae-impl (including required libraries). The archive contains the benchmarks of this section and additional examples as well as build and usage instructions. The prototype is a command-line implementation backed by an existing cost analysis library for (non-abstract) Java bytecode as well as the deductive verification system KeY [2] including the AE framework [37,38]. Our implementation consists of three components: (1) An extension of a cost analyzer (written in Python) to handle abstract Java programs, (2) a conversion tool (written in Java) translating the output of the analyzer to a set of input files for KeY, (3) a bash script orchestrating the whole tool chain, specifically, the interplay between item (1), item (2) and the two libraries. In case of a failed certification attempt, our script offers the choice to open the generated proof in KeY for further debugging. In total, our implementation (excluding the libraries) consists of 1,802 lines of Python, 703 lines of Java, and 389 lines of bash code (without blank lines and comments).

To assess effectiveness and efficiency of our approach, we used our QAE implementation to analyze seven typical code optimization rules using cost models Minstr (rows "1∗"–"6∗" in Table 1) and Mheap (rows "7∗"). While Minstr counts the number of instructions, Mheap measures heap consumption. The first column identifies the benchmark ("a" refers to the original program, "b" to the transformed one), the second **P** refers to the kind of proven cost result (asymptotic "a", exact "e", upper "u"), column three shows the inferred growth function for each loop in the program (separated by "," if there are two or more loops), in the fourth column we list the cost postcondition obtained by the analysis (expressions indicating the number of loop iterations are highlighted), and columns five to eight display performance metrics. Time tcost, given in milliseconds, is the time needed to perform the cost analysis. The proof generation time tproof is given in seconds. We also display the time tcheck needed for checking integrity of an already generated proof certificate. Finally, sproof is the size of the generated KeY proof in terms of number of proof steps. Even though the time needed for certification is significantly higher than for cost analysis (which is to be expected), each analysis can be performed within one minute. The time to *check* a proof certificate amounts to approximately one fourth to one third of the time needed to *generate* it. We stress that all analyses are *fully automatic*.

We briefly describe the nature of each experiment: **1** is a *loop unrolling* transformation duplicating the body of a loop: each copy of the body is put inside an **if** -statement conditioned by the loop guard. Here, we had to switch to *asymptotic* cost invariants: The cost analyzer over-approximates the number of iterations of the unrolled loop, since there are different possible control flows in the body. This was automatically detected by the certifier which failed to find a proof when exact cost invariants are conjectured and succeeds with asymptotic ones. **2** is the *CodeMotion* example from Sect. 2. The result reflects the cost *decrease* in the sense that less instructions need to be executed by the transformed program. **3** implements a *LoopTiling* optimization at compiler level in which a single loop with n · m iterations is transformed into two nested loops, an outer one looping until n and an inner one until m. Since our cost analyzer only handles linear size expressions, the first program is written using an auxiliary parameter t that is then instantiated to value n · m. **4** is a *SplitLoop* transformation splitting a loop with two independent parts into two separate loops. We prove that this transformation does not affect the cost up to a constant factor. **5** is an optimization combining *two loops* with the same body structure into one loop. **6** is a *three loops* example, one nested and one simple. The optimization combines the bodies of the outer loop in the nested structure and the simple loop. **7** is an *array* optimization, where an array declaration is moved in front of a loop, initializing it with an auxiliary parameter that is the sum of all the initial sizes.


Table 1: Results of the experiments.

# **6 Related Work**

The present paper builds on the original AE framework [37,38], which we extend to *Quantitative* AE. At the moment no other approach or tool is able to analyze and certify the cost of schematic programs, specifically relational properties, so a direct comparison is impossible.

*Cost Analysis.* There are many resource analysis tools, including: [20], based on introducing counters and inferring loop invariants; [23], based on an analysis over the depth of functional programs formalized by means of type systems. Approaches that bound the number of execution steps include [19,29], working at the level of compilers. Systems such as AProVE [17] analyze the complexity of Java programs by transforming them to integer transition systems; COSTA [3] and CoFloCo [16] are based on the generation of cost recurrence equations from which upper bounds can be inferred. That is also the basis of the approach we pursue to infer abstract upper bounds in Sect. 4.1, hence our technique can be viewed as a generalization of these systems. Approaches based on type systems could also be generalized to work on abstract programs by introducing abstract cost as in Sect. 4.1.

For our work it is crucial to use ranking functions to infer growth of cost invariants. Ranking functions were used to generate bounds on the number of loop iterations in several systems, but none used them to define growth: [10] obtain runtime complexity bounds via symbolic representation from ranking functions, likewise PUBS [3], Loopus [40], and ABC [8]. PUBS analyses all loop transitions at once, Loopus uses an iterative procedure where bounds are propagated from inner to outer loops, ABC deals with nested, but not sequential loops. In our work, when inferring upper bounds, we solve all transitions at once and handle nested as well as sequential loops.

*Certification.* Several general-purpose deductive software verification [21] tools exist, including VeryFast [34], Why [15], Dafny [28], KIV [33], and KeY [2]. We use KeY, the currently only system to implement AE. *Interactive* proof assistants like Isabelle [31] or Coq [7] also support more or less expressive abstract program fragments, but lack full automation. There are dedicated approaches involving schematic programs for *specific* contexts, like regression verification [18], compilation [22, 26, 30] or derived symbolic execution rules [12].

Regarding the combination of deductive verification and cost analysis, the closest approach to ours is the integration of COSTA and KeY [4] which was realized for concrete, not abstract programs. They verify upper bounds on the cost of concrete programs by decomposing them into ranking functions and size relations which are then verified separately. Here we use the novel concept of cost invariant that allows verification of quantitative properties without decomposition. Paper [4] deals only with the global number of iterations as is common in worst-case cost analysis. Our cost invariants are designed to be inductive and propagate cost through all loop iterations. Radiˇcek et al. [32] devise a formal framework for analyzing the relative cost of different programs (or the same program with different inputs). Compared to our approach, they target purely functional programs extended with monads representing cost, while we work with an industrial programming language. Moreover, we generally reason about the cost of *transformations*, not of a transformation applied to one *particular* program.

#### **7 Conclusion and Future Work**

We presented the first approach to analyze the cost of schematic programs with placeholders. We can infer and verify cost bounds for a potentially infinite class of programs once and for all. In particular, for the first time, it is possible to analyze and prove changes in efficiency caused by program transformations—for all input programs. Our approach supports exact and asymptotic cost and a configurable cost model. We implemented a tool chain based on a cost analyzer and a program verifier which analyzes and formally certifies abstract cost bounds in a fully automated manner. Certification is essential, because only the verifier can determine whether the bounds inferred by the cost analyzer are exact.

Our work required the new concept of an (abstract) cost invariant. This is interesting in itself, because (i) it renders the analysis of nested loops modular and (ii) provides an interface to backends (such as verifiers) that characterizes the cost of code in iterations.

Obvious future work involves extending the analyzed target language. Cost analysis and deductive verification (including AE) are already possible for a large Java fragment [3, 37]. More interesting—and more challenging—is the analysis of program transformations that parallelize code. The extension to larger classes of cost functions, such as logarithmic or exponential, could be realized by integrating non-linear SMT solvers into the tool chain.

*Acknowledgments.* This work was funded partially by the Spanish MCIU, AEI and FEDER(EU) project RTI2018-094403-B-C31, by the CM project S2018/TCS-4314 cofunded by EIE Funds of the EU and by the UCM CT42/18-CT43/18 grant.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Bootstrapping Automated Testing for RESTful Web Services**

Yixiong Chen<sup>1</sup> , Yang Yang<sup>1</sup>, Zhanyao Lei<sup>1</sup> , Mingyuan Xia<sup>2</sup> , and Zhengwei Qi<sup>1</sup> -

<sup>1</sup> Shanghai Jiao Tong University, Shanghai, China {lawischen,ylxy452782520,leizhanyao,qizhwei}@sjtu.edu.cn <sup>2</sup> AppetizerIO, Shanghai, China ken@appetizer.io

**Abstract.** Modern RESTful services expose RESTful APIs to integrate with diversified applications. Most RESTful API parameters are weakly typed, which greatly increases the possible input value space. This poses difficulties for automated testing tools to generate effective test cases to reveal web service defects related to parameter validation. We call this phenomenon the type collapse problem. To remedy this problem, we introduce FET (Format-encoded Type) techniques, including the FET, the FET lattice, and the FET inference to model fine-grained information for API parameters. Enhanced by FET techniques, automated testing tools can generate targeted test cases. We demonstrate Leif, a trace-driven fuzzing tool, as a proof-of-concept implementation of FET techniques. Experiment results on 27 commercial services show that FET inference precisely captures documented parameter definitions, which helps Leif to discover 11 new bugs and reduce 72% ∼ 86% fuzzing time as compared to state-of-the-art fuzzers.

**Keywords:** Fuzz Testing · RESTful Web Service · Type Inference.

# **1 Introduction**

The REST (Representational State Transfer) architecture [28] nowadays has dominated the design of complex web services, such as public clouds (e.g. AWS and Azure), social networking (e.g. Facebook and Twitter), and code hosting (e.g. GitHub and GitLab). Typically, a RESTful web service exposes a set of RESTful APIs. A client requests an API providing parameter values, and the service responds with data represented in some common exchange format (e.g. JSON or XML). According to a recent survey of 40 real-world popular RESTful web services [36], modern services involve an average of 64 APIs and over 20 parameters per API. Testing such an input space of possible parameter value combinatorics is challenging, and therefore automated testing is indispensable.

Since RESTful APIs are intended for applications implemented by different programming languages, API parameters are weakly typed. An investigation on 27 RESTful web services [19] shows that over 67% of the parameters are

<sup>©</sup> The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 46–66, 2021. https://doi.org/10.1007/978-3-030-71500-7 3

string-typed, about 32% are number-typed, and the remaining 1% are booleantyped or object-typed. Overusing primitive data types significantly increases the possible input value space. For example, a string-typed parameter can take values varying from a specific URL to a comment about a YouTube video. This poses difficulties for generating effective test cases. Consequently, many automated REST testing tools are ineffective while RESTful web services suffer from various input-related attacks, such as integer overflow attacks and SQL injection attacks [18]. We call this phenomenon the *type collapse problem*.

The solution is to bridge the gap for automated testing tools to have a better understanding of parameters. We observe that though parameter types are weak, their values usually have distinct formats. For example, a datetime parameter may require an ISO8601 date string. This motivates us to introduce the *FET (Format-encoded Type)* which combines *data types* and *value formats* to describe parameters in fine grains. For instance, the SHA1 FET represents 40-digit-hex string-typed parameters. Furthermore, we introduce the FET lattice which hierarchically organizes a set of FETs by a partial order, along with the FET inference which seeks suitable FETs among a FET lattice for parameters in an unambiguous manner.

To manifest how to enhance automated REST testing by FET techniques, we implement Leif, a trace-driven fuzz testing tool. Leif gains fine-grained parameter information by performing FET inference on HTTP traffic and then mutates parameter values to mimic real attacks based on the inferred results. We apply Leif to real-world web services, and the experiment results are encouraging. FET techniques provide better bug-finding capability and bring 72% ∼ 86% fuzzing time reduction for Leif when compared to state-of-the-art fuzzing tools.

In particular, this paper makes the following contributions:


The remainder of the paper is organized as follows. Section 2 analyzes the type collapse problem in detail. Section 3 introduces FET techniques to solve the type collapse problem. Section 4 introduces Leif as a proof-of-concept implementation of FET techniques. Section 5 presents the evaluation of FET techniques and Leif. Section 6 discusses related works and Section 7 concludes.

### **2 Motivation**

It is essential for automated REST testing tools to generate test cases by filling parameters with automatically generated values. This procedure requires adequate information about parameters. Otherwise, the possible candidate space would become enormous even for one single parameter. Therefore, a majority of state-of-the-art automated testing tools focus on reducing the candidate space by sophisticated methodologies. For instance, RESTler [13] arranges multiple APIs in the producer-consumer order, and uses response data gained from the previous APIs to request the next. Chizpurfle [23] and EvoMaster [12] generate optimal candidate values based on evolutionary algorithms.

Nevertheless, the previous works have not focused on the root cause of the candidate space explosion. Since most RESTful APIs are designed for exchanging data between programs implemented by different languages (e.g., Java for mobile applications while Python for the service), only a few common *primitive data types* can be used to represent API parameters. For example, Amazon's online shopping web service takes about 2,400 parameters, among which 748 are number-typed (31%) and 1,581 are string-typed (66%) [19]. That is, types, which are supposed to be diversified, now collapse into very limited cases. Consequently, existing automated testing tools encounter a huge candidate space, e.g., solely knowing a parameter is string-typed spans a boundless candidate space from paragraphs of Shakespeare to specific datetime strings. In addition, it is difficult to pick up effective values that can pass parameter checking, then reach actual business logic, and finally trigger bugs. Figure 1 shows a code sample of a RESTful API (requires four parameters: string-typed start, string-typed end, number-typed amount, and number-typed interest). In order to generate an effective value which can reach business logic for the parameter start, a testing tool has to know it is an ISO8601 datetime string. Unfortunately, since parameters are mainly in primitive data types, this information is usually hard to obtain. Therefore, the testing tool may treat it as an ordinary string and generate arbitrary strings which are all rejected by the parameter checking and thus are basically useless.

```
1 def calculate_monthly_installment():
```

```
2 try:
```

```
3 start = parse(request.get("start"), "YYYY-MM-DDTHH:MM:SSZ")
```

```
4 end = parse(request.get("end"), "YYYY-MM-DDTHH:MM:SSZ")
```

```
5 amount = float(request.get("amount"))
```

```
6 interest = float(request.get("interest"))
```
7 except Exception:

```
8 return make_response("Invalid Parameter", 400, "Bad Request")
```

```
9 # business logic
```

```
10 ...
```
**Fig. 1.** A Code Sample of a RESTful API (Written in Python).

The type collapse problem is the major obstacle to obtaining adequate parameter information and leads to inefficient automated testing. Therefore, our solution is to provide a fine-grained description method for parameters by exploiting both its data type and its value format. Leveraged by such information, we are able to bootstrap and enhance automated testing techniques to gain efficiency improvement when testing RESTful web services.

#### **3 FET Techniques**

To address the type collapse problem, we introduce FET techniques, including the FET (Format-encoded Type), the FET lattice, and the FET inference. A FET models an API parameter by its data type and its value format. A FET lattice hierarchically organizes a set of FETs based on a partial order. We design FET inference algorithms to seek suitable FETs among a FET lattice for parameters, and the inferred results are the critical information for bootstrapping test case generation strategies.

#### **3.1 Type Lattice**

The idea of the FET lattice is inspired by the type lattice [24] for programming languages widely used in compilation and program analysis [33, 44, 45]. A type lattice is a *complete lattice* defined on T, , where T is a set of data types (e.g. long in C/C++) and is a partial order representing type convertibility. Every two lattice elements have a unique *least upper bound* and a unique *greatest lower bound*. An element t<sup>j</sup> is said to *cover* another element t<sup>i</sup> if and only if t<sup>i</sup> t<sup>j</sup> but there does not exist a t<sup>m</sup> such that t<sup>i</sup> t<sup>m</sup> t<sup>j</sup> , where t<sup>i</sup> t<sup>j</sup> means t<sup>i</sup> t<sup>j</sup> and t<sup>i</sup> = t<sup>j</sup> . Type lattices can model class inheritance hierarchies for object-oriented languages. In this context, for any two elements t<sup>i</sup> and t<sup>j</sup> , t<sup>i</sup> t<sup>j</sup> holds if and only if t<sup>i</sup> inherits from or equals to t<sup>j</sup> . Figure 2 depicts a type lattice for java.util.Collection (each vertex represents a class or an interface, and each directed edge stands for the inheritance relationship).

The type lattice is the cornerstone of type systems for modern programming languages. In static compilation, the type lattice is applied to checking value assignment and type casting for code validity [38]. In dynamic compilation, e.g., JIT (Just-in-time Compilation) [14], it is employed to predict variable types at program points, so as to remove unnecessary type checking. The type lattice is a powerful tool to ensure the correctness and efficiency of programs. However, in the context of REST, API parameters only manifest limited primitive data types due to the type collapse problem, where the type lattice is no longer sufficient.

#### **3.2 FET Lattice**

A FET lattice is defined on Ψ ⊆ T ×F, . A FET ψ ∈ Ψ is defined by (tψ, fψ), where t<sup>ψ</sup> ∈ T is a *data type*, and f<sup>ψ</sup> ∈ F is a *value format* or more specifically a *set* of values. is a partial order that for any two FETs ψ<sup>i</sup> and ψ<sup>j</sup> , ψ<sup>i</sup> ψ<sup>j</sup>

**Fig. 2.** A Type Lattice for the Java Collections Framework.

holds if and only if t<sup>ψ</sup><sup>i</sup> is *type-convertible* to t<sup>ψ</sup><sup>j</sup> and f<sup>ψ</sup><sup>i</sup> is a *subset* of f<sup>ψ</sup><sup>j</sup> , denoted by t<sup>ψ</sup><sup>i</sup> t<sup>ψ</sup><sup>j</sup> and f<sup>ψ</sup><sup>i</sup> ⊆ f<sup>ψ</sup><sup>j</sup> . A FET ψ<sup>i</sup> *covered* by ψ<sup>j</sup> implies that ψ<sup>i</sup> describes parameter features in a finer grain than ψ<sup>j</sup> . ψ and ψ<sup>⊥</sup> are defined as (AnyType, U) and (NoType, ∅), where U is the set containing arbitrary values. Figure 3 depicts an example FET lattice (a FET's name describes its value format, and FETs at the same level are identically colored).

**FET Acceptance for Parameter Values**. Similar to type lattices, FET lattices help to determine FETs for given parameter values. To achieve this, we define that a value v is *accepted* by a FET ψ if and only if typeof(v) t<sup>ψ</sup> and v ∈ fψ, denoted by ψ ∈ acceptance(v). Otherwise v is said to be *rejected* by ψ, denoted by ψ /∈ acceptance(v). Spontaneously, ψ accepts all values while ψ<sup>⊥</sup> accepts none. A value v can be accepted by more than one FET, while the *greatest lower bound* of the acceptances describes the value in the finest grain. We call such an acceptance the *minimum acceptance* of v. The *predecessors* of the minimum acceptance accept v but describe it in a coarser grain, while the *siblings* reject v but describe other similar values in the same grain. The minimum acceptance, the predecessors, and the siblings of v compose a *tree*, denoted by ψ-tree(v). For example, for a SHA1 string v, its minimum acceptance (the SHA1 FET in Figure 3), the predecessors (Hash, String, and ψ) and the siblings (MD5, and SHA256) compose the ψ-tree(v).

**Avoiding the Ambiguity of FET Lattices**. As seen in Figure 3, if a single value is accepted by two sibling FETs (e.g. MD5 and SHA1), the minimum acceptance will fall into the trivial ψ⊥. Generally, a FET lattice is said to be *ambiguous* if there exist two FETs with the *same predecessor* can both accept the *same value*. To avoid ambiguity, a validation procedure is obligatory after a FET lattice is constructed, which is to ensure the value formats of every two sibling FETs with the same data type are always disjoint.

**Fig. 3.** An Example FET Lattice.

In practice, we specify value formats by the regular language, and provide a ubiquitous FET lattice [20] to model the most common RESTful parameters. We will elaborate FET lattice construction and verification in Section 4.2.

#### **3.3 FET Inference**

**Tree-merging FET Inference**. As discussed previously, for a single value v, a unique ψ-tree(v) can always be found in an unambiguous FET lattice. A RESTful API parameter usually involves multiple values in practice. Hence we give the *tree-merging FET inference*. For a parameter with values v1, ··· , vn, the tree-merging inference is to compute ψ-tree(v1), ··· , ψ-tree(vn), and then merge them into one tree. The merged tree is denoted by ψ-tree<sup>n</sup>(Vn) where V<sup>n</sup> = {v1, ··· , vn}. The tree-merging inference can be described as a "findexpand-merge" procedure: (1) find the minimum acceptance for a single value v<sup>i</sup> by performing a depth-first searching from ψ and add the predecessors along the searching path into the tree; (2) expand the tree by adding the siblings and then the ψ-tree(vi) is obtained; (3) repeat the step (1) and (2) for every value and merge all the trees. Step (1) and (2) are illustrated in Figure 4, and step (3) can be reduced to the DNS tree merging [25]. Assuming that the FET lattice has l levels with m FETs, the time complexity is O(m) for computing one tree and O(l) for merging two trees. Thus the time complexity of tree-merging FET inference for a parameter involving n values is O(n · (m + l)).

**Bitfield-boosting FET Inference**. In practice, we notice that the number of FETs m in a lattice is a constant while the number of values n is a variate (usually over 1,000). Therefore, we optimize the tree-merging FET inference based on three observations: (1) each FET can be uniquely represented by one bit in a m-bit bitfield, and therefore ψ-trees can be represented by several bits in such bitfields; (2) given a minimum acceptance, its ψ-tree can be uniquely

**Fig. 4.** Inferring ψ-tree(vi) for a Single Value vi.

determined, so the ψ-tree for every FET can be computed before inference; (3) merging two ψ-trees is equivalent to performing a bitwise OR operation on their corresponding bitfields.

Hence, we give the *forward computation algorithm* and the *bitfield-boosting FET inference*. The forward computation traverses the lattice in breadth-first order, assigns a unique bitfield ID per FET, and computes the ψ-tree, as shown in Algorithm 1. Leveraged by the forward computation, the bitfield-boosting inference only needs to find the minimum acceptance by the depth-first searching, yields the bitfield tree, and merges it into the <sup>ψ</sup>-tree<sup>i</sup>−<sup>1</sup>(V<sup>i</sup>−<sup>1</sup>), as shown in Algorithm 2. Therefore, the ψ-tree<sup>n</sup>(Vn) can be efficiently computed by a series of bitwise OR operations instead of graph computations, reducing the time complexity from O(n · (m + l)) to O(n · m).

### **4 FET-enhanced REST Fuzzing**

To manifest the utility of FET techniques, we design Leif, a FET-enhanced REST fuzzing tool, and we implement it to a command-line tool in 2,796 lines of Python code. This section elaborates the workflow of Leif, along with methodologies for collecting HTTP traffic (Section 4.1), for constructing FET lattices (Section 4.2), and for interfacing FET techniques with fuzzers (Section 4.3).

Figure 5 depicts Leif's workflow and its interaction with existing systems and tools. Leif assumes that the web service under test is already deployed on a staging server or in a production environment. The developer acquires the Leif program with a built-in FET lattice and traces HTTP traffic between the service and the clients. Then Leif identifies RESTful APIs by parsing the captured traffic and performs FET inference on parameter values. The inferred results are provided to bootstrap test case generating. Finally, Leif emits test cases and observes wrongful behaviors of the service.


**Input:** A FET Lattice. ID ← 1; queue ← Queue(ψ); **while** !queue.isEmpty() **do** current ← queue.pop(); current.ID ← ID; ID ← ID << 1; **foreach** ψ current **AND** ψ = ψ<sup>⊥</sup> **do** queue.push(ψ); ψ.pT ree ← 0; ψ.sT ree ← ψ.ID; ψ.tree ← ψ.pT ree ∨ ψ.sT ree; queue ← Queue(ψ); **while** !queue.isEmpty() **do** current ← queue.pop(); sT ree ← 0; **foreach** ψ current **AND** ψ = ψ<sup>⊥</sup> **do** sT ree ← sT ree ∨ ψ.ID; **foreach** ψ current **AND** ψ = ψ<sup>⊥</sup> **do** ψ.pT ree ← current.pT ree ∨ current.ID; ψ.sT ree ← sT ree; ψ.tree ← pT ree ∨ sT ree; queue.push(ψ);

#### **4.1 Collecting and Parsing HTTP Traffic**

As introduced in Section 3.3, the inferred result of a parameter is contributed by its different values, and therefore the accuracy of FET inference increases when Leif witnesses more value cases. Thus developers are expected to apply suitable tracing methods. For example, monkey testing and scripted regression testing are more preferred than unit testing to collect traffic. Leif takes the HAR file (an archival format for HTTP traffic [39]), which is the standard output of network proxies (Fiddler, MitmProxy [22], etc.), and browser inspection (e.g. Chrome, and Safari). To identify parameters, the payload (including the headers, the query string, and the body) of a captured request is parsed to key-value pairs in JSON format. Due to the type collapse problem, only four data types are present: boolean, number, string and object (including array). Non-objecttyped parameters are directly provided to FET inference while object-typed parameters are flattened. Since a JSON object is a tree of properties, Leif flattens it by splitting leaf properties to independent non-object-typed parameters and assigning new keys named by their JSONPaths [29], as illustrated in Figure 6. Then the flatten parameters are also provided to FET inference. Finally, FET inference receives parameters for each API where each parameter has a unique key and usually multiple values.

#### **Algorithm 2:** The Bitfield-boosting FET Inference.

```
Input: Parameter Values Vn = {v1, ··· , vn}.
  Output: ψ-treen(Vn).
1 ψ-tree0(V0) ← 0;
2 for i ← 1 to n do
3 current ← ψ;
4 accepted ← true;
5 while accepted do
 6 accepted ← f alse;
 7 foreach ψ  current do
 8 if ψ ∈ acceptance(vi) then
 9 current ← ψ;
10 accepted ← true;
11 ψ-treei
           (Vi) ← ψ-treei−1(Vi−1) ∨ current.tree;
12 return ψ-treen(Vn);
```
#### **4.2 Ubiquitous FET Lattice**

**Regular Expressions for Value Formats**. In Leif's built-in ubiquitous FET lattice, value formats are specified by regular expressions. We choose to use the regular language rather than creating a new language to define value formats because it has many advantages in this scenario. Firstly, regular expressions are the de-facto descriptions of most string formats. Although regular expressions are context-free, they can still distinguish different value formats. Secondly, they are already familiar to developers, and therefore they are easy to construct without extra learning costs. Finally, to ensure the unambiguity of a FET lattice is to ensure the regular expression orthogonality of sibling FETs, which can be formally determined by finite automata [46].

**FET Lattice Constructing and Updating**. We construct the ubiquitous FET lattice by referencing popular RESTful services (e.g. Google Map, AWS, Twitter, and GitHub): (1) we crawl API documents from these services and then identify potential FETs used in these services; (2) we construct regular expressions for these FETs by referencing related RFCs (e.g. RFC3339 [35] for ISO8601, and RFC3986 [16] for URI), programming language specifications (e.g. the Java specification [34] for PackageName), and database schema definitions (e.g. the MongoDB data type definition [21] for Hash) to build a base FET lattice; (3) we apply the Bayesian regular expression generation technique [42] to discover new FETs from traffic and merge them into the base lattice; (4) we verify the unambiguity by checking the orthogonality of regular expressions for sibling FETs, using dk.brics.automaton library [37]. The verified lattice has 21 FETs organized in 5 levels, and we believe it is competent to model most of the RESTful services. If a developer has application-specific FETs (at the first usage or when major service updates take place), one can update the lattice by adding FETs via step (3) and repeat step (4) for unambiguity verification.

**Fig. 5.** The Workflow Architecture of Leif.

(a) The Original Parameter. (b) The Tree Structure. (c) The Flattening Result.

**Fig. 6.** An Example of Object Flattening.

**Twinning FET Inference**. We notice some parameters can be represented by multiple data types and are minimally accepted by distinct FETs in different data types. For example, an epoch datetime (elapsed seconds or milliseconds since 1970-01-01 00:00:00) is accepted by the EpochString FET when it is represented by string while is accepted by the Integer FET when in number. Apparently, applying type casting to such parameters is very meaningful during testing. To support this feature, we implement the *twinning FET inference*. Before a value is inferred, Leif generates its twinning value if possible. If the original value is number-typed, Leif generates a twinning string-typed value (e.g. 1589809244481 → "1589809244481") and vice versa ("1589809244481" → 1589809244481). Then both values are inferred, and the resulting two ψtrees are merged as if Leif witnesses two independent values. By doing so, both

the Datetime and the Integer FETs are included in the final ψ-tree<sup>n</sup> of an epoch datetime parameter.

#### **4.3 FET-aware Trace-driven Fuzzing**

Trace-driven fuzzing tools generate test cases by replacing parameter values of captured requests with candidate values. Therefore the success of a fuzzer mainly depends on its quality of candidate values. In conventional tools, using a larger candidate dictionary is the basic strategy to increase the opportunity for triggering bugs, yet it lengthens the fuzzing time.

On the contrary, Leif provides a small but targeted dictionary for each FET and we give several examples (corresponding to Figure 3): Number is tried with integer overflows (8-bit, 16-bit, 32-bit, and 64-bit overflows) with signed and unsigned values; Datetime is tried with year overflows (year 2038, and year 10,000), invalid dates (e.g. 2019-2-29), and timezone tweaks; ISO8601 is tried with omitting meta characters ("-", ":", etc.); URI is tried with malformed URLs (e.g. doubling "/", stripping "protocol://", and unescaped characters). With each parameter tagged by a ψ-tree<sup>n</sup>, Leif generates test cases by exhausting dictionaries of all the FETs in the tree. Notice that, as discussed in Section 3.2, the predecessors and the siblings of the minimum acceptance describe similar but usually invalid values. Therefore, candidates from these FETs are the most likely values which can pass parameter checking and trigger bugs. For an API with multiple parameters, Leif exhausts dictionaries for one parameter each time and tests such API by iterations of exhaustion. In this way, Leif increases the opportunity to trigger bugs and meanwhile saves the fuzzing time.

# **5 Evaluation**

In this section, we evaluate Leif with real-world RESTful web services, and the complete dataset of our evaluation is publicly available [19]. Specifically, we design three experiments to answer the following research questions:


#### **5.1 FET Inference Accuracy Evaluation**

In this experiment, we assume that API documents provided by the service developers are the *ground truth* and we validate the accuracy of FET inference by comparing the inferred results with the ground truth. We choose GitHub<sup>3</sup> and Twitter<sup>4</sup>, and we randomly pick up 50 RESTful APIs (25 from each). We extract two pieces of information from document text: (1) parameter data types, as explicitly listed in the documents; (2) parameter value formats, as provided in the detailed descriptions (e.g. "This [the parameter since] is a timestamp in ISO8601 format."<sup>5</sup>). We feed example requests gained from the documents to FET inference, compare the inferred FETs with the ground truth, and observe three levels of matching:


Intuitively, an exact match precisely describes a parameter such that a fuzzer can exploit it to generate the most targeted values. A partial match is benign, for it includes values that will not appear in practice, and a fuzzer may generate a small set of useless values based on a partial match. A mismatch indicates that the value format is not yet supported by the current FET lattice.

**Fig. 7.** FET Inference Accuracy Evaluation Results.

Figure 7(a) exhibits the ratios of matching on GitHub (137 parameters), Twitter (86 parameters) and the weighted average (223 parameters). In total, 149 (67%) inferred results are exact matches, and 71 (32%) are partial matches.

<sup>3</sup> https://docs.github.com/en/free-pro-team@latest/rest/reference

<sup>4</sup> https://developer.twitter.com/en/docs

<sup>5</sup> https://docs.github.com/en/free-pro-team@latest/rest/reference/gists

And we observe 3 mismatches in two cases: one is a binary-array parameter for file uploading and the other is an array of key-value pairs (e.g. [["key1", "value1"], ["key2", "value2"], ...]). Binary arrays can be supported by adding a FET ([01]\* for the value format) to the current lattice, but Leif aims to detect logic-related bugs while binaries are usually logic-free but contentsensitive [43]. Therefore Leif simply does not mutate them. As for key-value pairs, they are actually two-dimensional arrays where the first dimension is immutable since it indicates the actual parameter key. We consider allowing developers to specify which special parameters are immutable in Leif's future version to support such cases. For the partial matches, we review the documents, and the top cases are application-specified formats such as comma-separated strings and PGP signatures. These formats are less common and developers can add application-specific FETs to their lattices by following the steps introduced in Section 4.2. Figure 7(b) exhibits the breakdown of exact matches (the inner ring is the distribution of the primitive data types and the outer ring is the inferred FETs) to quantify how FET inference improves parameter information. The coarse-grained number-typed (27%) and string-typed (61%) parameters are divided into much smaller slices (5% ∼ 14%). The breakdown clarifies that FET inference classifies parameters in balance, and therefore restores the collapsed types. This enables a fuzzer to generate more targeted values, which shrinks candidate space and increases the opportunity to find bugs.

#### **5.2 Leif Effectiveness Evaluation**

In this experiment, we select 27 popular mobile applications to evaluate the effectiveness of Leif. Each of them is backed by a commercial RESTful web service serving millions and billions of users. We monkey-test [30] each application for 20 minutes, capture HTTP traffic and run the full-stack Leif workflow. Table 1 lists the subjects and the services have an average of 133 RESTful APIs with over 19 parameters per API. We collect 46 requests per API on average which yields adequate request samples for inference. Leif reports 5XX HTTP responses as bugs along with the corresponding traffic. We have reached out to the service owners, reported these bugs, and validated these bugs through analysis of traffic (through API URLs, parameter key-value pairs, and response data) and analysis of the involved applications (through reverse engineering and static code analysis of APKs) to eliminate any false-positive or duplicated cases. Table 2 summarizes the 11 distinct bugs found by Leif. The testing process is fully automated which mimics how developers would use Leif as a black-box fuzzing tool in practice and our following analysis mimics how to classify bugs and locate related code lines based on Leif's testing results.

**Security Bugs with Information Leakage**. Bug 1, 2 and 10 are security bugs with information leakage problems. They can be reproduced by mutating the parameter appVer (VersionTag), the parameter platform (Identifier), and the parameter c.v (Integer). These bugs not only cause service crashes but also expose sensitive information to end users (potential attackers). With the exposed information, attackers can easily design specialized attacks. For example, the


**Table 1.** Experiment Subjects of the Effectiveness Validation.

a The statistic is from Tencent AppStore (https://sj.qq.com) up to Jan. 9th, 2020.

response data of bug 10 contains the full Java exception stack trace without any obfuscation. From the stack trace, attackers can obtain that the service uses an outdated Spring Framework<sup>6</sup> version which suffers from numerous security vulnerabilities [5,6,8–11]. By exploiting CVE-2020-5421 and CVE-2020-5398 [10, 11], attackers can initiate reflected file download attacks [31] to mislead users into downloading malware. And by exploiting CVE-2018-1257 [5], attackers can expose STOMP over WebSocket and then initiate denial of service attacks [17]. They can also obtain that the service uses com.alibaba.fastjson library<sup>7</sup> to deserialize user inputs. Therefore attackers can launch remote code executions by exploiting known defects in that specific library version [7, 32].

Upon such cases, we suggest developers should first avoid information leakage problems by checking the service data flow, ensuring that no sensitive methods

<sup>6</sup> Spring Framework, https://spring.io/projects/spring-framework

<sup>7</sup> Fastjson, https://github.com/alibaba/fastjson


**Table 2.** Bugs Found by Leif during the Effectiveness Validation.

a Bug 3 and bug 4 involve the same API but with different HTTP status codes.

b Bug 5 and bug 11 involve the same API but different applications. <sup>c</sup>

Bug 6 and bug 7 involve the same API path but different domain names.

(e.g., java.lang.Exception.toString) can be output to end users, and then diagnose security problems by analyzing server logs. Besides, they should stay alert to public vulnerability reports and timely upgrade their codebases.

**Third-party API Bugs**. We notice that 6 of the bugs involve APIs provided by third parties. Bug 3 and 4 involve the API for user authorization provided by Sina Weibo, a social networking platform serving over half a billion users. We decompile the Sina News APK and locate the related code lines. We find out the application uses a deprecated version of the API. When this API fails, an unhandled exception is propagated and causes the application to crash. It can be reproduced by injecting meta characters "/.:/" to the parameter packagename (PackageName) and to the parameter mfp (Hash). Bug 6 and 7 involve the API provided by a customer service platform. The application also suffers from the deprecated API and crashes when the API fails. Bug 5 and 11 are detected in different applications but involve the same API provided by Baidu. These two bugs can be reproduced by mutating the parameter SdkVer (VersionTag).

Using third-party APIs is very common, but they are often overlooked during testing. However, bugs in third-party code are as important as the application's own code, because they both mean application functionality failure to billions of end users. Our results show that Leif can find bugs across into third-party APIs. We suggest that developers should capture application traffic and apply Leif to test untrusted third-party APIs. In addition, they should design proper exception handling logic for third-party code and timely upgrade to the latest API versions with known bugs fixed.

**Bugs with Limited Information**. We obtain very limited information from bug 8 and 9, because their responses solely contain HTTP status codes. These bugs could be as critical as the security bugs since they involve a private API and cause the service to crash. Therefore service developers can debug such APIs by following the analysis methods for the security bugs as mentioned.

#### **5.3 Comparative Evaluation**

**Leif vs. Trace-driven Fuzzers**. We classify Leif as a trace-driven fuzzer and we now compare it with state-of-the-art trace-driven fuzzing tools. We select BurpSuite [2], a commercial security testing fuzzer for RESTful web services, and Fuzzapi [3], an open-source general-purpose HTTP fuzzer. They provide built-in candidate dictionaries but require a series of manual configurations, including filling the URL for each API and the data type for each parameter. Therefore we only apply them to Sina News, Toutiao, and Amazon Shopping (518 unique APIs with 15,512 parameters in total). In addition, we implement NaiveFuzzer as a baseline that only understands primitive data types and randomly mutates parameter values solely based on such coarse-grained information. We construct NaiveFuzzer's candidate dictionaries by combining the dictionaries of BurpSuite and Fuzzapi.

We evaluate the bug-finding capabilities of BurpSuite, Fuzzapi, Leif, and NaiveFuzzer by comparing the number of bugs found by each tool, as reported in Figure 8(a). And we evaluate their fuzzing time by comparing the averaged number of test cases generated per parameter, as exhibited in Figure 8(b). Less generated test cases mean less test execution time, leading to the more efficient fuzzing. Considering the subjects are already well-tested before release, we believe the bug-finding capability of Leif is better than BurpSuite and Fuzzapi for Leif finds extra bugs. And NaiveFuzzer has the same capability as BurpSuite and Fuzzapi. This is because they share the same candidate space. As for fuzzing time, BurpSuite, Fuzzapi and NaiveFuzzer respectively generate 5.0× ∼ 6.7×, 3.6× ∼ 4.7× and 6.3× ∼ 7.1× test cases of Leif, indicating FET techniques bring 72% ∼ 86% fuzzing time reduction.

**Leif vs. Specification-driven Fuzzers**. We now compare Leif with existing specification-driven fuzzers, which test RESTful web services based on parsing API specifications. We select RESTler [13], a state-of-the-art research fuzzer, and TnT-Fuzzer [4], an open-source robustness testing tool. They both require OpenAPI specifications [40] as input, but most of the subject services do not provide OpenAPI specifications. Therefore we construct OpenAPI specifications for Sina News, Toutiao, and Amazon Shopping by parsing HTTP traffic and referencing their official API documents.

We intend to run RESTler, but unfortunately neither the executable program nor the source code is available. According to the paper, RESTler only supports

**Fig. 8.** Bug-finding Capabilities and Fuzzing Time of the Evaluated Fuzzers.

primitive data types and uses a plain candidate dictionary (consisting of 0, 1, "", and "sampleString"). Yet none of the bugs found by Leif can be triggered by these values, indicating that performing RESTler would fail to detect any of the bugs. And TnT-Fuzzer generates candidate values simply based on the Python random() function (i.e. purely random fuzzing). We configure it to generate 1,000 test cases per parameter (about 5× of NaiveFuzzer and 30× of Leif). Still, TnT-Fuzzer fails to find any bugs in the three services. We conclude that the two fuzzers' effectiveness is limited by the practical hardness of finding wellwritten OpenAPI specifications and the quality of their candidates. These are also the main shortcomings of all specification-driven fuzzers. Besides, many modern APIs require short-lived session tokens for access control or throttling. Specification-driven fuzzers require manual configuration or even repeated reconfiguration for such parameters. In contrast, it is easy for trace-driven fuzzers to achieve this requirement by mutating freshly captured requests.

### **6 Related Work**

**Model-driven Testing**. Model-driven testing [15, 26, 27, 47, 48] is usually white-box and requires using some specific modeling method (e.g. UML or DSL) through the whole lifecycle of developing, which is human-intensive and technically-limited for services across multiple servers and micro-services from different vendors. Essentially, FET techniques are also model-driven (i.e. driven by the lattice model) but only intervene in the test phase. Thus FET techniques can be practically employed to test diversified RESTful web services in black-box approaches.

**Trace-driven Fuzzing**. Trace-driven fuzzing generates test cases by mutating recorded requests. Fuzzapi [3], BurpSuite [2], AppSpider [1] and Leif all fall into this category. Existing trace-driven fuzzers mainly focus on improving the ability to capture and replay HTTP traffic. However, Leif demonstrates that FET techniques provide fundamental parameter information to fuzzers, bringing the enhanced bug-finding capability and significant fuzzing time reduction.

**Specification-driven Fuzzing**. Another main class of fuzz testing techniques is specification-driven fuzzing, such as TnT-Fuzzer [4], EvoMaster [12], and RESTler [13], which avoids the type collapse problem by assuming developers provide well-defined specifications with detailed parameter information. However, the OpenAPI [40] is the only well-established standard up to now, yet is not widely used. A survey [41] reveals that 71% developers lack the knowledge of the OpenAPI framework. Therefore, the specification-driven fuzzing is still too idealistic for testing real-world RESTful web services. In comparison, instead of asking developers for good specifications, FET techniques generate fine-grained specifications (i.e. ψ-trees<sup>n</sup> of parameters) on its own.

**Security Penetration Testing**. Fuzz testing techniques are also commonly purposed for security penetration testing. Commercial security penetration tools, such as BurpSuite [2], use values of SQL injections, unescaped HTML characters, XML/JSON external entities, etc., to expose system vulnerabilities. FET techniques can also be employed in security penetration testing, as demonstrated in Section 5.2. While our main goal is not limited to security testing for RESTful web services, because FET techniques improve the value selecting strategy for general-purpose REST fuzzing.

#### **7 Conclusion and Future Work**

In this paper, we analyze the type collapse problem and propose FET techniques to remedy this problem. As a proof-of-concept, we design and implement Leif, a FET-enhanced trace-driven fuzzing tool. We demonstrate that using FET techniques greatly improves a fuzzer's understanding of parameters, resulting in more effective fuzz testing. Our experiment results show that Leif unveils 11 new bugs in application-specific web services as well as general third-party open API platforms with 72% ∼ 86% fuzzing time reduction.

FET techniques are capable of effectively bootstrapping automated testing tools. We believe they are also helpful for parameter validity checking because these two technical problems are isomorphic in a sense. Thus we are beginning to study how to automatically generate or enhance parameter checking code based on FET techniques for RESTful web services.

#### **Acknowledgments**

We would like to thank the anonymous reviewers for their valuable comments. This work was supported in part by National Key Research Development Program of China (No. 2016YFB1000502), National NSF of China (No. 61672344, 61525204, and 61732010), Shanghai Pujiang Program (No. 19PJ1430900), and Shanghai Key Laboratory of Scalable Computing and Systems.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Decision Tree Lifted Domain for Analyzing Program Families with Numerical Features**

Aleksandar S. Dimovski <sup>1</sup>, Sven Apel <sup>2</sup>, and Axel Legay <sup>3</sup>

<sup>1</sup> Mother Teresa University, 12 Udarna Brigada 2a, 1000 Skopje, North Macedonia aleksandar.dimovski@unt.edu.mk

<sup>2</sup> Saarland University, Saarland Informatics Campus, E1.1, 66123 Saarbr¨ucken,

Germany

<sup>3</sup> Universit´e catholique de Louvain, 1348 Ottignies-Louvain-la-Neuve, Belgium

**Abstract.** *Lifted* (*family-based*) *static analysis* by abstract interpretation is capable of analyzing all variants of a program family simultaneously, in a single run without generating any of the variants explicitly. The elements of the underlying lifted analysis domain are tuples, which maintain one property per variant. Still, explicit property enumeration in tuples, one by one for all variants, immediately yields combinatorial explosion. This is particularly apparent in the case of program families that, apart from Boolean features, contain also numerical features with large domains, thus giving rise to astronomical configuration spaces.

The key for an efficient lifted analysis is a proper handling of variabilityspecific constructs of the language (e.g., feature-based runtime tests and #if directives). In this work, we introduce a new symbolic representation of the lifted abstract domain that can efficiently analyze program families with numerical features. This makes sharing between property elements corresponding to different variants explicitly possible. The elements of the new lifted domain are constraint-based *decision trees*, where decision nodes are labeled with linear constraints defined over numerical features and the leaf nodes belong to an existing single-program analysis domain. To illustrate the potential of this representation, we have implemented an experimental lifted static analyzer, called SPLNum2Analyzer, for inferring invariants of C programs. An empirical evaluation on BusyBox and on benchmarks from SV-COMP yields promising preliminary results indicating that our decision trees-based approach is effective and outperforms the baseline tuple-based approach.

#### **1 Introduction**

Many software systems today are configurable [6]: they use *features* (or configurable options) to control the presence and absence of functionality. Different family members, called variants, are derived by switching features on and off, while the reuse of common code is maximized, leading to productivity gains, shorter time to market, greater market coverage, etc. Program families (e.g., software product lines) are commonly seen in the development of commercial embedded software, such as cars, phones, avionics, medicine, robotics, etc. Configurable

c The Author(s) 2021 E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 67–86, 2021.

https://doi.org/10.1007/978-3-030-71500-7 4

options (features) are used to either support different application scenarios for embedded components, to provide portability across different hardware platforms and configurations, or to produce variations of products for different market segments or customers. We consider here program families implemented using #if directives from the C preprocessor CPP [20]. They use #if-s to specify in which conditions parts of code should be included or excluded from a variant. Classical program families use only Boolean features that have two values: on and off. However, Boolean features are insufficient for real-world program families, as there exist features that have a range of numbers as possible values. These features are called *numerical features* [25]. For instance, Linux kernel, BusyBox, Apache web server, Java Garbage Collector represent some real-world program families with numerical features. Analyzing such program families is very challenging, due to the fact that from only a few features, a huge number of variants can be derived.

In this paper, we are concerned with the verification of program families with Boolean and numerical features using abstract interpretation-based static analysis. *Abstract interpretation* [7,24] is a general theory for approximating the semantics of programs. It provides sound (all confirmative answers are correct) and efficient (with a good trade-off between precision and cost) static analyses of run-time properties of real programs. It has been used as the foundation for various successful industrial-scale static analyzers, such as ASTREE [8]. Still, the static analysis of program families is harder than the static analysis of single programs, because the number of possible variants can be very large (often huge) in practice. The simplest brute-force approach that uses a preprocessor to generate all variants of a family, and then applies an existing off-the-shelf single-program analyzer to each individual variant, one-by-one, is very inefficient [3,27]. Therefore, we use so-called *lifted* (family-based) *static analyses* [3,22,27], which analyze all variants of the family simultaneously without generating any of the variants explicitly. They take as input the common code base, which encodes all variants of a program family, and produce precise analysis results corresponding to all variants. They use a lifted analysis domain, which represents an n-fold product of an existing single-program analysis domain used for expressing program properties (where n is the number of valid configurations). That is, the lifted analysis domain maintains one property element per valid variant in tuples. The problem is that this explicit property enumeration in tuples becomes computationally intractable with larger program families because the number of variants (i.e., configurations) grows exponentially with the number of features. This problem has been successfully addressed for program families that contain only Boolean features [1,2,11], by using sharing through binary decision diagrams (BDDs). However, the fundamental limitation of existing lifted analysis techniques is that they are not able to handle numerical features.

To overcome this limitation, we present a *new, refined lifted abstract domain for effectively analyzing program families with numerical features by means of abstract interpretation*. The elements of the lifted abstract domain are constraintbased *decision trees*, where the decision nodes are labelled with linear constraints

over numerical features, whereas the leaf nodes belong to a single-program analysis domain. The decision trees recursively partition the space of configurations (i.e., the space of possible combinations of feature values), whereas the program properties at the leaves provide analysis information corresponding to each partition, i.e. to the variants (configurations) that satisfy the constraints along the path to the given leaf node. The partitioning is dynamic, which means that partitions are split by feature-based tests (at #if directives), and joined when merging the corresponding control flows again. In terms of decision trees, this means that new decision nodes are added by feature-based tests and removed when merging control flows. In fact, the partitioning of the set of configurations is semantics-based, which means that linear constraints over numerical features that occur in decision nodes are automatically inferred by the analysis and do not necessarily occur syntactically in the code base.

Our lifted abstract domain is parametric in the choice of numerical property domain [7,24] that underlies the linear constraints over numerical features labelling decision nodes, and the choice of the single-program analysis domain for leaf nodes. In fact, in our implementation, we also use numerical property domains for leaf nodes, which encode linear constraints over program variables. We rely on the well-known numerical domains, such as intervals [7], octagons [23], polyhedra [10], from the APRON library [19] to obtain a concrete decision tree-based implementation of the lifted abstract domain. This way, we have implemented a *forward reachability analysis* of C program families with numerical (and Boolean) features for the automatic inference of invariants. Our tool, called SPLNum2Analyzer<sup>4</sup>, computes a set of possible invariants, which represent linear constraints over program variables. We can use the implemented lifted static analyzer to check invariance properties of C program families, such as assertions, buffer overflows, null pointer references, division by zero, etc [8].

In summary, we make several contributions: (1) We propose a new, parameterized lifted analysis domain based on decision trees for analyzing program families with numerical features; (2) We implement a prototype lifted static analyzer, SPLNum2Analyzer, that performs a forward analysis of #if-enriched C programs, where numerical property domains from the APRON library are used as parameters in the lifted analysis domain; (3) We evaluate our approach for automatic inference of invariants by comparing performances of lifted analyzers based on tuples and decision trees.

#### **2 Motivating Example**

To illustrate the potential of a decision tree-based lifted domain, we consider a motivating example using the code base of the following program family SIMPLE:

<sup>4</sup> Num<sup>2</sup> in the name of the tool refers to its ability to both handle Numerical features and to perform Numerical client analysis of SPLs (program families).

```
1 int x := 10, y := 0;
2 while (x != 0) {
3 x := x-1;
4 #if (SIZE ≤ 3) y := y+1; #else y := y-1; #endif
5 #if (!B) y := 0; #else skip; #endif 6 }
7 assert (y > 1);
```
The set <sup>F</sup> of features is {B, SIZE}, where <sup>B</sup> is a Boolean feature and SIZE is a numerical feature whose domain is [1, 4] = {1, 2, 3, 4}. Thus, the set of valid configurations is <sup>K</sup> <sup>=</sup> {<sup>B</sup> <sup>∧</sup> (SIZE= 1), <sup>B</sup> <sup>∧</sup> (SIZE= 2), <sup>B</sup> <sup>∧</sup> (SIZE= 3), <sup>B</sup> <sup>∧</sup> (SIZE<sup>=</sup> 4), ¬B ∧ (SIZE = 1), ¬B ∧ (SIZE = 2), ¬B ∧ (SIZE = 3), ¬B ∧ (SIZE = 4)}. The code of SIMPLE contains two #if directives, which change the value assigned to y, depending on how features from F are set at compile-time. For each configuration from K, a different variant (single program) can be generated by appropriately resolving #if-s. For example, the variant corresponding to configuration B ∧ (SIZE= 1) will have B and SIZE set to true and 1, so that the assignment y := y+1 and skip in program locations <sup>4</sup> and <sup>5</sup> , respectively, will be included in this variant. The variant for configuration ¬B∧(SIZE= 4) will have features B and SIZE set to false and 4, so the assignments y := y-1 and y := 0 in program locations <sup>4</sup> and <sup>5</sup> , respectively, will be included in this variant. There are <sup>|</sup>K<sup>|</sup> = 8 variants that can be derived from the family SIMPLE.

Assume that we want to perform *lifted polyhedra analysis* of SIMPLE using the *Polyhedra* numerical domain [10]. The standard lifted analysis domain used in the literature [3,22] is defined as cartesian product of <sup>|</sup>K<sup>|</sup> copies of the basic analysis domain (e.g. polyhedra). Hence, elements of the lifted domain are tuples containing one component for each valid configuration from K, where each component represents a polyhedra linear constraint over program variables (x and <sup>y</sup> in this case). The lifted analysis result in location <sup>7</sup> of SIMPLE is an 8-sized tuple shown in Fig. 1. Note that the first component of the tuple in Fig. 1 corresponds to configuration B ∧ (SIZE= 1), the second to B ∧ (SIZE= 2), the third to B ∧ (SIZE= 3), and so on. We can see in Fig. 1 that the polyhedra analysis discovers very precise results for the variable y: (y= 10) for configurations B ∧ (SIZE = 1), B ∧ (SIZE = 2), and B ∧ (SIZE = 3); (y = −10) for configuration B∧(SIZE= 4); and (y= 0) for all other configurations. This is due to the fact that the polyhedra domain is fully relational and is able to track all relations between program variables x and y. Using this result in location <sup>7</sup> , we can successfully conclude that the assertion is valid for configurations B∧(SIZE= 1), B∧(SIZE= 2), and B ∧ (SIZE= 3), whereas the assertion fails for all other configurations.

If we perform lifted polyhedra analysis based on the *decision tree domain* proposed in this work, then the corresponding decision tree inferred in the final program location <sup>7</sup> of SIMPLE is depicted in Fig. 2. Notice that the inner nodes of the decision tree in Fig. 2 are labeled with *Interval* linear constraints over features (SIZE and B), while the leaves are labeled with the *Polyhedra* linear constraints over program variables x and y. Hence, we use two different numerical abstract domains in our decision trees: Interval domain [7] for expressing properties in decision nodes, and Polyhedra domain [10] for expressing properties

Fig. 1: Tuple-based invariant at location <sup>7</sup> of SIMPLE.

Fig. 2: Decision tree-based invariant at location <sup>7</sup> of SIMPLE (solid edges = true, dashed edges = false).

in leaf nodes. The edges of decision trees are labeled with the truth value of the decision on the parent node; we use solid edges for true (i.e. the constraint in the parent node is satisfied) and dashed edges for false (i.e. the negation of the constraint in the parent node is satisfied). As decision nodes partition the space of valid configurations K, we implicitly assume the correctness of linear constraints that take into account domains of numerical features. For example, the node with constraint (SIZE≤3) is satisfied when (SIZE≤3) ∧ (1≤SIZE≤4), whereas its negation is satisfied when (SIZE>3) ∧ (1≤SIZE≤4). The constraints (1≤SIZE≤4) represent the domain [1, 4] of SIZE. We can see that decision trees offer more possibilities for sharing and interaction between analysis properties corresponding to different configurations, they provide symbolic and compact representation of lifted analysis elements. For example, Fig. 2 presents polyhedra properties of two program variables x and y, which are partitioned with respect to features B and SIZE. When (B ∧ (SIZE ≤ 3)) is true the shared property is (y = 10, x = 0), whereas when (B ∧ ¬(SIZE ≤ 3)) is true the shared property is (y=−10, x = 0). When ¬B is true, the property is independent from the value of SIZE, hence a node with a constraint over SIZE is not needed. Therefore, all such cases are identical and so they share the same leaf node (y= 0, x= 0). In effect, the decision tree-based representation uses only three leafs, whereas the tuple-based representation uses eight properties. This ability for sharing is the key motivation behind the decision trees-based representation.

#### **3 A Language for Program Families**

Let <sup>F</sup> <sup>=</sup> {A1,...,Ak} be a finite and totaly ordered set of *numerical features* available in a program family. For each feature <sup>A</sup> <sup>∈</sup> <sup>F</sup>, dom(A) <sup>⊆</sup> <sup>Z</sup> denotes the set of possible values that can be assigned to A. Note that any Boolean feature can be represented as a numerical feature <sup>B</sup> <sup>∈</sup> <sup>F</sup> with dom(B) = {0, <sup>1</sup>}, such that 0 means that feature B is disabled while 1 means that B is enabled. A valid combination of feature's values represents a *configuration* k, which specifies one *variant* of a program family. It is given as a *valuation function* <sup>k</sup> : <sup>F</sup> <sup>→</sup> <sup>Z</sup>, which is a mapping that assigns a value from dom(A) to each feature A, i.e. <sup>k</sup>(A) <sup>∈</sup> dom(A) for any <sup>A</sup> <sup>∈</sup> <sup>F</sup>. We assume that only a subset <sup>K</sup> of all possible configurations are *valid*. An alternative representation of configurations is based upon propositional formulae. Each configuration <sup>k</sup> <sup>∈</sup> <sup>K</sup> can be represented by a formula: (A<sup>1</sup> = k(A1)) ∧ ... ∧ (A<sup>k</sup> = k(Ak)). We often abbreviate (B = 1) with <sup>B</sup> and (<sup>B</sup> = 0) with <sup>¬</sup>B, for a Boolean feature <sup>B</sup> <sup>∈</sup> <sup>F</sup>. The set of valid configurations <sup>K</sup> can be also represented as a formula: <sup>∨</sup><sup>k</sup>∈<sup>K</sup>k.

We define *feature expressions*, denoted *FeatExp*(F), as the set of propositional logic formulas over constraints of F generated by the grammar:

$$\theta \mathrel{\mathop{:} :=} \text{true} \mid e\_{\mathbb{F}\_{\mathbb{Z}}} \bowtie e\_{\mathbb{F}\_{\mathbb{Z}}} \mid \neg \theta \mid \theta\_1 \land \theta\_2 \mid \theta\_1 \lor \theta\_2, \qquad e\_{\mathbb{F}\_{\mathbb{Z}}} ::= n \mid A \mid e\_{\mathbb{F}\_{\mathbb{Z}}} \oplus e\_{\mathbb{F}\_{\mathbb{Z}}}$$

where <sup>A</sup> <sup>∈</sup> <sup>F</sup>, <sup>n</sup> <sup>∈</sup> <sup>Z</sup>, ⊕∈{+, <sup>−</sup>, ∗}, and  ∈ {=, <}. We will use <sup>θ</sup> <sup>∈</sup> *FeatExp*(F) to write presence conditions. When a configuration <sup>k</sup> <sup>∈</sup> <sup>K</sup> satisfies a feature expression <sup>θ</sup> <sup>∈</sup> *FeatExp*(F), we write <sup>k</sup> <sup>|</sup><sup>=</sup> <sup>θ</sup>, where <sup>|</sup>= is the standard satisfaction relation of logic. We write [[θ]] to denote the set of configurations from K that satisfy θ, that is, k ∈ [[θ]] iff k |= θ.

*Example 1.* For the SIMPLE program family from Section 2, the set of features is <sup>F</sup> <sup>=</sup> {B, SIZE} where dom(SIZE) = [1, 4], and the set of configurations is <sup>K</sup> <sup>=</sup> {<sup>B</sup> <sup>∧</sup> (SIZE= 1), <sup>B</sup> <sup>∧</sup> (SIZE= 2), <sup>B</sup> <sup>∧</sup> (SIZE= 3), <sup>B</sup> <sup>∧</sup> (SIZE= 4), <sup>¬</sup><sup>B</sup> <sup>∧</sup> (SIZE<sup>=</sup> 1), ¬B ∧ (SIZE= 2), ¬B ∧ (SIZE= 3), ¬B ∧ (SIZE= 4)}. For the feature expression (SIZE ≤ 3), we have [[(SIZE ≤ 3)]] = {B ∧ (SIZE = 1), B ∧ (SIZE = 2), B ∧ (SIZE = 3), ¬B ∧ (SIZE = 1), ¬B ∧ (SIZE = 2), ¬B ∧ (SIZE = 3)}. Hence, B ∧ (SIZE = 2) <sup>|</sup>= (SIZE <sup>≤</sup> 3) and <sup>B</sup> <sup>∧</sup> (SIZE = 4) <sup>|</sup>= (SIZE <sup>≤</sup> 3), where <sup>B</sup> <sup>∧</sup> (SIZE = 2) <sup>∈</sup> <sup>K</sup>, <sup>B</sup> <sup>∧</sup> (SIZE= 4) <sup>∈</sup> <sup>K</sup>, and (SIZE≤3) <sup>∈</sup> *FeatExp*(F).

We consider a simple sequential non-deterministic programming language, which will be used to exemplify our work. The program variables *Var* are statically allocated and the only data type is the set Z of mathematical integers. To encode multiple variants, a new compile-time conditional statement is included. The new statement "#if (θ) <sup>s</sup> #endif" contains a feature expression <sup>θ</sup> <sup>∈</sup> *FeatExp*(F) as a presence condition, such that only if <sup>θ</sup> is satisfied by a configuration <sup>k</sup> <sup>∈</sup> <sup>K</sup> the statement s will be included in the variant corresponding to k. The syntax is:

s ::= skip | x:=e | s; s | if (e) then s else s | while (e) do s | #if (θ) s #endif, e ::= n | [n, n ] | x | e⊕e

where n ranges over integers, [n, n ] over integer intervals, x over program variables *Var*, and ⊕ over binary arithmetic operators. Integer intervals [n, n ] denote a random choice of an integer in the interval. The set of all statements s is denoted by *Stm*; the set of all expressions e is denoted by *Exp*.

A program family is evaluated in two stages. First, the C *preprocessor* CPP takes a program family <sup>s</sup> and a configuration <sup>k</sup> <sup>∈</sup> <sup>K</sup> as inputs, and produces a variant (without #if-s) corresponding to k as the output. Second, the obtained variant is evaluated using the standard single-program semantics. The first stage is specified by the projection function Pk, which is an identity for all basic statements and recursively pre-processes all sub-statements of compound


Fig. 3: Different variants of the program family SIMPLE from Section 2.

statements. Hence, Pk(skip) = skip and Pk(s;s ) = Pk(s);Pk(s ). The interesting case is "#if (θ) s #endif", where statement s is included in the variant if k |= θ, otherwise, <sup>s</sup> is removed <sup>5</sup>: <sup>P</sup>k(#if (θ) <sup>s</sup> #endif) = Pk(s) if k |= θ skip if k |= θ . For example, variants P<sup>B</sup>∧(SIZE=1)(SIMPLE), P<sup>B</sup>∧(SIZE=4)(SIMPLE), P¬B∧(SIZE=1)(SIMPLE), as well as P¬B∧(SIZE=4)(SIMPLE) shown in Fig. 3a, Fig. 3b, Fig. 3c, and Fig. 3d, respectively, are derived from the SIMPLE family defined in Section 2.

### **4 Lifted Analysis based on Tuples**

Lifted analyses are designed by *lifting* existing single-program analyses to work on program families, rather than on individual programs. They directly analyze program families. Lifted analysis as defined by Midtgaard et. al. [22] rely on a lifted domain that is <sup>|</sup>K|-fold product of an existing single-program analysis domain A defined over program variables *Var*. We assume that the domain A is equipped with sound operators for concretization γA, ordering <sup>A</sup>, join <sup>A</sup>, meet <sup>A</sup>, bottom ⊥<sup>A</sup>, top <sup>A</sup>, widening ∇<sup>A</sup>, and narrowing <sup>A</sup>, as well as sound transfer functions for tests FILTER<sup>A</sup> and forward assignments ASSIGNA. More specifically, FILTERA(a : A, e : *Exp*) returns an abstract element from A obtained by restricting a to satisfy the test e, whereas ASSIGNA(a : A, x:=e : *Stm*) returns an updated version of a by abstractly evaluating x:=e in it.

*Lifted Domain.* The *lifted analysis domain* is defined as A<sup>K</sup>, ˙ , ˙ , ˙ , <sup>⊥</sup>˙ , ˙ , where <sup>A</sup><sup>K</sup> is shorthand for the <sup>|</sup>K|-fold product <sup>k</sup>∈<sup>K</sup> <sup>A</sup>, that is, there is one separate copy of A for each configuration of K. For example, consider the tuple in Fig. 1.

*Lifted Abstract Operations.* Given a tuple (lifted domain element) <sup>a</sup> <sup>∈</sup> <sup>A</sup><sup>K</sup>, the projection π<sup>k</sup> selects the kth component of a. All abstract lifted operations are defined by lifting the abstract operations of the domain A configuration-wise.

γ(a) = <sup>k</sup>∈<sup>K</sup>(γA(πk(a))), <sup>a</sup>1˙ <sup>a</sup><sup>2</sup> <sup>≡</sup> <sup>π</sup>k(a1)<sup>A</sup> <sup>π</sup>k(a2), for <sup>∀</sup>k∈<sup>K</sup> <sup>a</sup><sup>1</sup> ˙ <sup>a</sup><sup>2</sup> <sup>=</sup> <sup>k</sup>∈<sup>K</sup>(πk(a1) <sup>A</sup> <sup>π</sup>k(a2)), <sup>a</sup><sup>1</sup> ˙ <sup>a</sup><sup>2</sup> <sup>=</sup> <sup>k</sup>∈<sup>K</sup>(πk(a1) <sup>A</sup> <sup>π</sup>k(a2)) ˙ <sup>=</sup> <sup>k</sup>∈<sup>K</sup> <sup>A</sup> = (<sup>A</sup>,..., <sup>A</sup>), <sup>⊥</sup>˙ <sup>=</sup> <sup>k</sup>∈<sup>K</sup> <sup>⊥</sup><sup>A</sup> = (⊥<sup>A</sup>,..., <sup>⊥</sup><sup>A</sup>) <sup>a</sup><sup>1</sup> <sup>∇</sup>˙ <sup>a</sup><sup>2</sup> <sup>=</sup> <sup>k</sup>∈<sup>K</sup>(πk(a1)∇<sup>A</sup>πk(a2)), <sup>a</sup><sup>1</sup> ˙ <sup>a</sup><sup>2</sup> <sup>=</sup> <sup>k</sup>∈<sup>K</sup>(πk(a1)<sup>A</sup>πk(a2))

<sup>5</sup> Since <sup>k</sup> <sup>∈</sup> <sup>K</sup> is a valuation function, either <sup>k</sup> <sup>|</sup><sup>=</sup> <sup>θ</sup> holds or <sup>k</sup> |<sup>=</sup> <sup>θ</sup> holds for any <sup>θ</sup>.

*Lifted Transfer Functions.* We now define lifted transfer functions for tests, forward assignments (ASSIGN), and #if-s (IFDEF). There are two types of tests: *expression-based tests*, denoted FILTER, that occur in while-s and ifs, and *feature-based tests*, denoted FEAT-FILTER, that occur in #if-s. Each lifted transfer function takes as input a tuple from A<sup>K</sup> representing the invariant before evaluating the statement (resp., expression) to handle, and returns a tuple representing the invariant after evaluating the given statement (resp., expression).

$$\begin{array}{l} \overline{\text{FILTER}}(\overline{a}: \mathbb{A}^{\mathbb{K}}, e: Exp) = \prod\_{k \in \mathbb{K}} (\text{FILTER}\_{\mathbb{A}}(\pi\_{k}(\overline{a}), e)) \\ \overline{\text{FELAT-FILTER}}(\overline{a}: \mathbb{A}^{\mathbb{K}}, \theta: \text{FectExp}(\mathbb{F})) = \prod\_{k \in \mathbb{K}} \begin{cases} \pi\_{k}(\overline{a}), & \text{if } k \mid = \theta \\ \bot\_{\mathbb{A}}, & \text{if } k \nmid \neq \theta \end{cases} \\ \overline{\text{ASISHGN}}(\overline{a}: \mathbb{A}^{\mathbb{K}}, \mathbf{x}: \mathsf{e}: Stm) = \prod\_{k \in \mathbb{K}} (\text{ASSIMGN}\_{\mathbb{A}}(\pi\_{k}(\overline{a}), \mathbf{x}: \mathsf{e})) \\ \overline{\text{IFEDEF}}(\overline{a}: \mathbb{A}^{\mathbb{K}}, \mathbf{x} \mathbf{if} \ (\theta) \, s: Stm) = \overline{\|s\|} (\overline{\text{FELAT-FILTER}}(\overline{a}, \theta)) \, \overline{\text{IFEAT-FILTER}}(\overline{a}, \cdot \theta) \end{array}$$

where [[s]](a) is the lifted transfer function for statement s. FILTER and ASSIGN are defined by applying FILTER<sup>A</sup> and ASSIGN<sup>A</sup> independently on each component of the input tuple a. FEAT-FILTER keeps those components k of the input tuple a that satisfy θ, otherwise it replaces the other components with ⊥<sup>A</sup>. IFDEF captures the effect of analyzing the statement s in the components k of a that satisfy θ, otherwise it is an identity for the other components.

*Lifted Analysis.* Lifted abstract operators and transfer functions of the lifted analysis domain A<sup>K</sup> are combined together to analyze program families. Initially, we build a tuple ain where all components are set to <sup>A</sup> for the first program location, and tuples where all components are set to ⊥<sup>A</sup> for all other locations. The analysis properties are propagated forward from the first program location towards the final location taking assignments, #if-s, and tests into account with join and widening around while-s. The *soundness* of the lifted analysis based on A<sup>K</sup> follows immediately from the soundness of all abstract operators and transfer functions of A (proved in [22]).

*Numerical Lifted Analysis* The single-program analysis domain A can be instantiated by some of the well-known numerical property domains [24], such as Intervals *I*, <sup>I</sup> [7], Octagons *O*, O [26], and Polyhedra *P*, <sup>P</sup> [10]. The elements of *<sup>I</sup>* are intervals of the form: <sup>±</sup><sup>x</sup> <sup>≥</sup> <sup>β</sup>, where <sup>x</sup> <sup>∈</sup> *Var*, β <sup>∈</sup> <sup>Z</sup>; the elements of *<sup>O</sup>* are conjunctions of octagonal constraints of the form ±x<sup>1</sup> ± x<sup>2</sup> ≥ β, where x1, x<sup>2</sup> ∈ *Var*, β <sup>∈</sup> <sup>Z</sup>; while the elements of *<sup>P</sup>* are conjunctions of polyhedral constraints of the form <sup>α</sup>1x<sup>1</sup> <sup>+</sup> ... <sup>+</sup> <sup>α</sup>kx<sup>k</sup> <sup>+</sup> <sup>β</sup> <sup>≥</sup> 0, where <sup>x</sup>1,...x<sup>k</sup> <sup>∈</sup> *Var*, α1,...,αk, β <sup>∈</sup> <sup>Z</sup>.

# **5 Lifted Analysis based on Decision Trees**

We now introduce a new *decision tree* lifted domain. Its elements are disjunctions of leaf nodes that belong to an existing single-program domain A defined over program variables *Var*. The leaf nodes are separated by linear constraints over

numerical features, organized in the decision nodes. Hence, we encapsulate the set of configurations K into the decision nodes of a decision tree where each topdown path represents one or several configurations that satisfy the constraints encountered along the given path. We store in each leaf node the property generated from the variants representing the corresponding configurations.

*Abstract domain for decision nodes.* We define the family of abstract domains for linear constraints CD, which are parameterized by any of the numerical property domains <sup>D</sup> (intervals *<sup>I</sup>*, octagons *<sup>O</sup>*, polyhedra *<sup>P</sup>*). We use <sup>C</sup>*<sup>I</sup>* <sup>=</sup> {+−A<sup>i</sup> <sup>≥</sup> <sup>β</sup> <sup>|</sup> <sup>A</sup><sup>i</sup> <sup>∈</sup> <sup>F</sup>, β <sup>∈</sup> <sup>Z</sup>} to denote the set of *interval constraints*, <sup>C</sup>*<sup>O</sup>* <sup>=</sup> {+−A<sup>i</sup> <sup>+</sup><sup>−</sup> <sup>A</sup><sup>j</sup> <sup>≥</sup> <sup>β</sup> <sup>|</sup> <sup>A</sup>i, A<sup>j</sup> <sup>∈</sup> <sup>F</sup>, β <sup>∈</sup> <sup>Z</sup>} to denote the set of *octagonal constraints*, and <sup>C</sup>*<sup>P</sup>* <sup>=</sup> {α1A<sup>1</sup> <sup>+</sup> ...+αkAk+<sup>β</sup> <sup>≥</sup> <sup>0</sup> <sup>|</sup> <sup>A</sup>1,...A<sup>k</sup> <sup>∈</sup> <sup>F</sup>, α1,...,αk, β <sup>∈</sup> <sup>Z</sup>, gcd(|α1|,..., <sup>|</sup>αk|, <sup>|</sup>β|)=1} to denote the set of *polyhedral constraints*. We have C*<sup>I</sup>* ⊆ C*<sup>O</sup>* ⊆ C*P*.

The set C<sup>D</sup> of linear constraints over features F is constructed by the underlying numerical property domain D, <sup>D</sup> using the Galois connection P(CD), <sup>D</sup> <sup>−</sup> ←−−−−−−→− <sup>α</sup>C<sup>D</sup> γC<sup>D</sup> D, <sup>D</sup>, where <sup>P</sup>(CD) is the power set of <sup>C</sup>D. The abstraction function <sup>α</sup><sup>C</sup><sup>D</sup> : <sup>P</sup>(CD) <sup>→</sup> <sup>D</sup> maps a set of interval (resp., octagon, polyhedral) constraints to an interval (resp., an octagon, polyhedral) that represents a conjunction of constraints; the concretization function <sup>γ</sup><sup>C</sup><sup>D</sup> : <sup>D</sup> → P(CD) maps an interval (resp., an octagon, a polyhedron) that represents a conjunction of constraints to a set of interval (resp., octagonal, polyhedral) constraints. We have γ<sup>C</sup><sup>D</sup> (<sup>D</sup>) = ∅ and γ<sup>C</sup><sup>D</sup> (⊥<sup>D</sup>) = {⊥<sup>C</sup><sup>D</sup> }, where ⊥<sup>C</sup><sup>D</sup> is an unsatisfiable constraint.

The domain of decision nodes is <sup>C</sup>D. We assume <sup>F</sup> <sup>=</sup> {A1,...,Ak} be a finite and totally ordered set of features, such that the ordering is A<sup>1</sup> > A<sup>2</sup> >...>Ak. We impose a total order <<sup>C</sup><sup>D</sup> on C<sup>D</sup> to be the lexicographic order on the coefficients α1,...,α<sup>k</sup> and constant α<sup>k</sup>+1 of the linear constraints, such that:

$$\begin{array}{ll} \left(\alpha\_1 \cdot A\_1 + \ldots + \alpha\_k \cdot A\_k + \alpha\_{k+1} \ge 0\right) &<\_{\mathbb{C}\mathbb{D}} \left(\alpha'\_1 \cdot A\_1 + \ldots + \alpha'\_k \cdot A\_k + \alpha'\_{k+1} \ge 0\right) \\ \iff \exists j > 0. \forall i < j. (\alpha\_i = \alpha'\_i) \land (\alpha\_j < \alpha'\_j) \end{array}$$

The negation of linear constraints is formed as: ¬(α1A<sup>1</sup> + ...αkA<sup>k</sup> + β ≥ 0) = −α1A<sup>1</sup> − ... − αkA<sup>k</sup> − β − 1 ≥ 0. For example, the negation of A − 3 ≥ 0 is the constraint −A + 2 ≥ 0 (i.e., A ≤ 2). To ensure canonical representation of decision trees, a linear constraint c and its negation ¬c cannot both appear as nodes in a decision tree. For example, we only keep the largest constraint with respect to <<sup>C</sup><sup>D</sup> between c and ¬c. For this reason, we define the equivalence relation <sup>≡</sup><sup>C</sup><sup>D</sup> as <sup>c</sup> <sup>≡</sup><sup>C</sup><sup>D</sup> <sup>¬</sup>c. We define CD, <<sup>C</sup><sup>D</sup> to denote CD/≡, <<sup>C</sup><sup>D</sup> , such that elements of <sup>C</sup><sup>D</sup> are constraints obtained by quotienting by the equivalence <sup>≡</sup><sup>C</sup><sup>D</sup> .

*Abstract domain for constraint-based decision trees.* A *constraint-based decision tree* <sup>t</sup> <sup>∈</sup> <sup>T</sup>(CD, <sup>A</sup>) over the sets <sup>C</sup><sup>D</sup> of linear constraints defined over <sup>F</sup> and the leaf abstract domain <sup>A</sup> defined over *Var* is either a leaf node \$a% with <sup>a</sup> <sup>∈</sup> <sup>A</sup>, or [[<sup>c</sup> : tl, tr]], where <sup>c</sup> <sup>∈</sup> <sup>C</sup><sup>D</sup> (denoted by t.c) is the smallest constraint with respect to <<sup>C</sup><sup>D</sup> appearing in the tree t, tl (denoted by t.l) is the left subtree of t representing its *true branch*, and tr (denoted by t.r) is the right subtree of t representing its *false branch*. The path along a decision tree establishes the set

of configurations (those that satisfy the encountered constraints), and the leaf nodes represent the analysis properties for the corresponding configurations.

*Example 2.* The following two constraint-based decision trees t<sup>1</sup> and t<sup>2</sup> have decision nodes labelled with Interval linear constraints over the numeric feature SIZE with domain {1, 2, 3, 4}, whereas leaf nodes are Interval properties:

$$t\_1 = \left[ \text{SIZE} \ge 4 : \ll[y \ge 2] \gg , \ll[y = 0] \gg \right], \ t\_2 = \left[ \text{SIZE} \ge 2 : \ll[y \ge 0] \gg , \ll[y \le 0] \gg \right] \square$$

*Abstract Operations.* The *concretization function* γ<sup>T</sup> of a decision tree t ∈ <sup>T</sup>(CD, <sup>A</sup>) returns <sup>γ</sup>A(a) for <sup>k</sup> <sup>∈</sup> <sup>K</sup>, where <sup>k</sup> satisfies the set <sup>C</sup> ∈ P(CD) of constraints accumulated along the top-down path to the leaf node <sup>a</sup> <sup>∈</sup> <sup>A</sup>. More formally, <sup>γ</sup>T(t) = <sup>γ</sup>T[K](t). The function <sup>γ</sup><sup>T</sup> accumulates into a set <sup>C</sup> ∈ P(CD) constraints along the paths up to a leaf node, which is initially equal to the set of implicit constraints over <sup>F</sup>, <sup>K</sup>=∨<sup>k</sup>∈<sup>K</sup>k, taking into account domains of features:

$$\overline{\gamma}\_{\mathbb{T}}[C](\ll a \gg) = \prod\_{k \mid \!= C} \gamma\_{\mathbb{A}}(a), \quad \overline{\gamma}\_{\mathbb{T}}[C](\{c \colon tl, tr\}) = \overline{\gamma}\_{\mathbb{T}}[C \cup \{c\}](tl) \times \overline{\gamma}\_{\mathbb{T}}[C \cup \{\!\neg c\}](tr)$$

Note that k |= C is equivalent with α<sup>C</sup><sup>D</sup> ({k}) <sup>D</sup> α<sup>C</sup><sup>D</sup> (C). Therefore, we can check <sup>k</sup> <sup>|</sup><sup>=</sup> <sup>C</sup> using the abstract operation <sup>D</sup> of the numerical domain <sup>D</sup>.

Other binary operations of T(CD, A) are based on Algorithm 1 for *tree unification*, which finds a common refinement (labelling) of two trees t<sup>1</sup> and t<sup>2</sup> by calling function UNIFICATION(t1, t2, K). It possibly adds new constraints as decision nodes (Lines 5–7, Lines 11–13), or removes constraints that are redundant (Lines 3,4,9,10,15,16). The function UNIFICATION accumulates into the set <sup>C</sup> ∈ P(CD) (initialized to K, which represents implicit constraints satisfied by both t<sup>1</sup> and t2), constraints encountered along the paths of the decision tree. This set C is used by the function isRedundant(c, C), which checks whether the linear constraint <sup>c</sup> <sup>∈</sup> <sup>C</sup><sup>D</sup> is redundant with respect to <sup>C</sup> by testing <sup>α</sup><sup>C</sup><sup>D</sup> (C) <sup>D</sup> <sup>α</sup><sup>C</sup><sup>D</sup> ({c}). Note that the tree unification does not lose any information.

*Example 3.* Consider constraint-based decision trees t<sup>1</sup> and t<sup>2</sup> from Example 2. After tree unification UNIFICATION(t1, t2, K), the resulting decision trees are:

$$\begin{array}{l} t\_1 = [\mathsf{SIZE} \ge 4 : \ll[y \ge 2] \gg , [\mathsf{SIZE} \ge 2 : \ll[y = 0] \gg , \ll[y = 0] \gg ]], \\ t\_2 = [\mathsf{SIZE} \ge 4 : \ll[y \ge 0] \gg , [\mathsf{SIZE} \ge 2 : \ll[y \ge 0] \gg , \ll[y \le 0] \gg ]] \end{array}$$

Note that UNIFICATION adds a decision node for SIZE ≥ 2 to the right subtree of t1, whereas it adds a decision node for SIZE ≥ 4 to t<sup>2</sup> and removes the redundant constraint SIZE ≥ 2 from the resulting left subtree of t2.

All binary operations are performed leaf-wise on the unified decision trees. Given two unified decision trees t<sup>1</sup> and t2, their ordering and join are defined as:

$$\begin{aligned} \mathbb{K} \ll & a\_1 \gg \sqcap\_{\mathbb{T}} \ll a\_2 \gg = a\_1 \sqsubseteq\_{\mathbb{A}} a\_2, \quad [c:t l\_1, tr\_1] \sqsubseteq\_{\mathbb{T}} [c:t l\_2, tr\_2] = (t l\_1 \sqsubseteq\_{\mathbb{T}} t l\_2) \land (tr\_1 \sqsubseteq\_{\mathbb{T}} tr\_2) \\ \mathbb{K} \ll a\_1 \gg l\_{\mathbb{T}} \ll a\_2 \gg = \ll a\_1 \sqcup\_{\mathbb{A}} a\_2 \gg, \quad [c:t l\_1, tr\_1] \sqcup\_{\mathbb{T}} [c:t l\_2, tr\_2] = [c:t l\_1 \sqcup\_{\mathbb{T}} t l\_2, tr\_1 \sqcup\_{\mathbb{T}} tr\_2] \end{aligned}$$

Similarly, we compute meet, widening, and narrowing of t<sup>1</sup> and t2. The top is a tree with a single <sup>A</sup> leaf: <sup>T</sup> =\$<sup>A</sup>%, while the bottom is: ⊥<sup>T</sup> =\$⊥<sup>A</sup>%.

*Example 4.* Consider the unified trees t<sup>1</sup> and t<sup>2</sup> from Example 3. We have that t<sup>1</sup> <sup>T</sup> t<sup>2</sup> holds, and t1<sup>T</sup>t<sup>2</sup> =[[SIZE≥4:\$[y≥0]%, [[SIZE≥2:\$[y≥0]%, \$[y≤0]%]]]].

**Algorithm 1:** UNIFICATION(t1, t2, C) **if** isLeaf(t1) ∧ isLeaf(t2) **then return (**t1, t2**); if** isLeaf(t1) ∨ (isNode(t1) ∧ isNode(t2) ∧ t2.c <<sup>C</sup><sup>D</sup> t1.c) **then if** isRedundant(t2.c, C) **then return** UNIFICATION(t1, t2.l, C)**; if** isRedundant(¬t2.c, C) **then return** UNIFICATION(t1, t2.r, C)**;** (l1, l2) = UNIFICATION(t1, t2.l, C ∪ {t2.c})**;** (r1, r2) = UNIFICATION(t1, t2.r, C ∪ {¬t2.c})**; return (**[[t2.c : l1, r1]], [[t2.c : l2, r2]]**); if** isLeaf(t2) ∨ (isNode(t1) ∧ isNode(t2) ∧ t1.c <<sup>C</sup><sup>D</sup> t2.c) **then if** isRedundant(t1.c, C) **then return** UNIFICATION(t1.l, t2, C)**; if** isRedundant(¬t1.c, C) **then return** UNIFICATION(t1.r, t2, C)**;** (l1, l2) = UNIFICATION(t1.l, t2, C ∪ {t1.c})**;** (r1, r2) = UNIFICATION(t1.r, t2, C ∪ {¬t1.c})**; return (**[[t1.c : l1, r1]], [[t1.c : l2, r2]]**); 14 else if** isRedundant(t1.c, C) **then return** UNIFICATION(t1.l, t2.l, C)**; if** isRedundant(¬t1.c, C) **then return** UNIFICATION(t1.r, t2.r, C)**;** (l1, l2) = UNIFICATION(t1.l, t2.l, C ∪ {t1.c})**;** (r1, r2) = UNIFICATION(t1.r, t2.r, C ∪ {¬t1.c})**; return (**[[t1.c : l1, r1]], [[t1.c : l2, r2]]**);**


*Transfer functions.* The transfer functions for forward assignments (ASSIGNT) and expression-based tests (FILTERT) modify only leaf nodes of a constraintbased decision tree. In contrast, transfer functions for variability-specific constructs, such as feature-based tests (FEAT-FILTERT) and #if-s (IFDEFT) add, modify, or delete decision nodes of a decision tree. This is due to the fact that the analysis information about program variables is located in leaf nodes, while the information about feature variables is located in decision nodes.

Transfer function ASSIGN<sup>T</sup> for handling an assignment x:=e in the input tree t is described by Algorithm 2. Note that x ∈ *Var*, and e ∈ *Exp* may contain only program variables. We apply ASSIGN<sup>A</sup> to each leaf node a of t, which substitutes expression e for variable x in a. Similarly, transfer function FILTER<sup>T</sup> for handling expression-based tests e ∈ Exp is implemented by applying FILTER<sup>A</sup> leaf-wise.

Transfer function FEAT-FILTER<sup>T</sup> for feature-based tests θ is described by Algorithm 3. It reasons by induction on the structure of θ (we assume negation is applied to atomic propositions). When θ is an atomic constraint over numerical features (Lines 2,3), we use FILTER<sup>D</sup> to approximate θ, thus producing a set of constraints J, which are then added to the tree t, possibly discarding all paths of t that do not satisfy θ. This is done by calling function RESTRICT(t, K, J), which

#### **Algorithm 3:** FEAT-FILTERT(t, θ**)**

 **switch** θ **do case** (e<sup>F</sup><sup>Z</sup> e<sup>F</sup><sup>Z</sup> ) || (¬(e<sup>F</sup><sup>Z</sup> e<sup>F</sup><sup>Z</sup> )) **do** <sup>J</sup> <sup>=</sup> FILTERD(<sup>D</sup>, θ); **return** RESTRICT(t, <sup>K</sup>, J) **case** θ<sup>1</sup> ∧ θ<sup>2</sup> **do return** FEAT-FILTERT(t, θ1) <sup>T</sup> FEAT-FILTERT(t, θ2) **case** θ<sup>1</sup> ∨ θ<sup>2</sup> **do return** FEAT-FILTERT(t, θ1) <sup>T</sup> FEAT-FILTERT(t, θ2)

adds linear constraints from J to t in ascending order with respect to <<sup>C</sup><sup>D</sup> as shown in Algorithm 4. Note that θ may not be representable exactly in C<sup>D</sup> (e.g., in the case of non-linear constraints over F), so FILTER<sup>D</sup> may produce a set of constraints approximating it. When θ is a conjunction (resp., disjunction) of two feature expressions (Lines 4,5) (resp., (Lines 6,7)), the resulting decision trees are merged by operation meet <sup>T</sup> (resp., join <sup>T</sup>). Function RESTRICT(t, C, J), described in Algorithm 4, takes as input a decision tree t, a set C of linear constraints accumulated along paths up to a node, and a set J of linear constraints in canonical form that need to be added to t. For each constraint j ∈ J, there exists a boolean b<sup>j</sup> that shows whether the tree should be constrained with respect to j or with respect to ¬j. When J is not empty, the linear constraints from J are added to t in ascending order with respect to <<sup>C</sup><sup>D</sup> . At each iteration, the smallest linear constraint j is extracted from J (Line 9), and is handled appropriately based on whether j is smaller (Line 11–15), or greater or equal (Line 17–21) to the constraint at the node of t we currently consider.

Finally, transfer function IFDEF<sup>T</sup> is defined as:

IFDEFT(t, #if (θ) s) = [[s]]T(FEAT-FILTERT(t, θ)) <sup>T</sup> FEAT-FILTERT(t, ¬θ)

where [[s]]T(t) denotes the transfer function in T(CD, A) for statement s.

After applying transfer functions, the obtained decision trees may contain some redundancy that can be exploited to further compress them. Function COMPRESST(t, C), described by Algorithm 5, is applied to decision trees t in order to compress (reduce) their representation. We use five different optimizations. First, if constraints on a path to some leaf are unsatisfiable, we eliminate that leaf node (Lines 9,10). Second, if a decision node contains two same subtrees, then we keep only one subtree and we also eliminate the decision node (Lines 11–13). Third, if a decision node contains a left leaf and a right subtree, such that its left leaf is the same with the left leaf of its right subtree and the constraint in the decision node is less or equal to the constraint in the root of its right subtree, then we can eliminate the decision node and its left leaf (Lines 14,15). A similar rule exists when a decision node has a left subtree and a right leaf (Lines 16,17).

*Lifted analysis.* The abstract operations and transfer functions of T(CD, A) can be used to define the lifted analysis for program families. Tree tin at the initial **Algorithm 4:** RESTRICT(t, C, J)

```
1 if isEmpty(J) then
 2 if isLeaf(t) then return t;
 3 if isRedundant(t.c, C) then return RESTRICT(t.l, C, J);
 4 if isRedundant(¬t.c, C) then return RESTRICT(t.r, C, J);
 5 l = RESTRICT(t.l, C ∪ {t.c}, J) ;
 6 r = RESTRICT(t.r, C ∪ {¬t.c}, J) ;
 7 return ([[t.c : l, r]]);
8 else
 9 j = min<CD (J) ;
10 if isLeaf(t) ∨ (isNode(t) ∧ j <CD t.c) then
11 if isRedundant(j, C) then return RESTRICT(t, C, J\{j});
12 if isRedundant(¬j, C) then return ⊥A
;
13 if j =CD t.c then (if bj then t = t.l; else t = t.r) ;
14 if bj then return ([[j : RESTRICT(t, C ∪ {j}, J\{j}), ⊥A
]]) ;
15 else return ([[j :⊥A
, RESTRICT(t, C ∪ {j}, J\{j})]]) ;
16 else
17 if isRedundant(t.c, C) then return RESTRICT(t.l, C, J);
18 if isRedundant(¬t.c, C) then return RESTRICT(t.r, C, J);
19 l = RESTRICT(t.l, C ∪ {t.c}, J) ;
20 r = RESTRICT(t.r, C ∪ {¬t.c}, J) ;
21 return ([[t.c : l, r]]);
```
location has only one leaf node <sup>A</sup> and decision nodes that define the set <sup>K</sup>. Note that if <sup>K</sup> <sup>≡</sup> true, then <sup>t</sup>in <sup>=</sup> <sup>T</sup>. In this way, we collect the possible invariants in the form of decision trees at all program locations.

We establish correctness of the lifted analysis based on T(CD, A) by showing that it produces identical results with tuple-based domain A<sup>K</sup>. Let [[s]]<sup>T</sup> and [[s]] denote transfer functions of statement s in T(CD, A) and A<sup>K</sup>, respectively. Recall that ain = <sup>k</sup>∈<sup>K</sup> <sup>A</sup>, and so <sup>γ</sup>T(tin) = <sup>γ</sup>(ain).

**Theorem 1.** γ<sup>T</sup> [[s]]T(tin) = γ [[s]](ain) *.*

*Proof.* The proof is by induction on the structure of s. We consider the most interesting cases: #if (θ) s #endif. Transfer functions for #if are identical in both lifted domains. We only need to show that FEAT-FILTER(a, θ) and FEAT-FILTERT(t, θ) are identical. This is shown by induction on θ [13].

*Example 5.* Let us consider the code base of a program family P given in Fig. 4. It contains only one numerical feature SIZE with domain N. The decision tree inferred at the final location <sup>4</sup> is depicted in Fig. 5. It uses the Interval domain for both decision and leaf nodes. Note that the constraint (SIZE < 3) does not explicitly appear in the code base, but we obtain it in the decision tree representation. This shows that partitioning of the configuration space K induced by decision trees is semantics-based rather than syntactic-based.

#### **Algorithm 5:** COMPRESST(t, C**)**

```
1 switch t do
2 case n
 do
3 return n
;
4 case [[t.c : l, r]] do
5 l
         -
          = COMPRESST(t.l, C ∪ {t.c}) ;
6 r-
          = COMPRESST(t.r, C ∪ {¬t.c}) ;
7 switch l
               -

                , r-
                   do
8 case n-

                   l
, n-

                         r
 do
9 if UNSAT(C ∪ {t.c}) then return n-

                                              r
;
10 if UNSAT(C ∪ {¬t.c}) then return n-

                                               l
;
11 if n-

                  l = n-

                      r then return n-

                                     l
;
12 case [[c1 : l1, r1]], [[c2 : l2, r2]] when c1 = c2 ∧ l1 = l2 ∧ r1 = r2 do
13 return [[c1 : l1, r1]];
14 case n-

                   l
, [[c2 : l2, r2]] when n-

                                       l
= l2 ∧ c ≤CD c2 do
15 return [[c2 : l2, r2]];
16 case [[c1 : l1, r1]], n-

                            r
 when n-

                                       r
= r1 ∧ c1 ≤CD c do
17 return [[c1 : l1, r1]];
18 case default: do
19 return [[t.c : l
                          -

                           , r-

                             ]];
```
<sup>1</sup> int x := 0; <sup>2</sup> #if (SIZE ≤ 4) x := x+1; #else x := x-1; #endif <sup>3</sup> #if (SIZE==3 || SIZE==4) x := x-2; #endif <sup>4</sup>

Fig. 4: Code base for program family P. Fig. 5: Decision tree at loc. <sup>4</sup> of P.

*Example 6.* Let us consider the code base of a program family P given in Fig. 6. It contains one numerical feature A with domain [1, 4] and a non-linear feature expression A ∗ A < 9. At program location <sup>2</sup> , FEAT-FILTERT(\$x = 0%, A ∗ A < 9) returns an over-approximating tree \$x = 0%, whereas FEAT-FILTERT(\$x = 0%, ¬(A ∗ A < 9)) returns [[A ≥ 3, \$x = 0%, \$⊥<sup>I</sup> %]]. In effect, we obtain an over-approximating result at the final program location <sup>3</sup> as shown in Fig. 7. The precise result at the program location <sup>3</sup> , which can be obtained in case we have numerical domains that can handle non-linear constraints, is given in Fig. 8. We observe that when ¬(A ≤ 2), we obtain an over-approximating analysis result (−1≤x≤1 instead of x = −1) due to the over-approximation of the non-linear feature expression in the numerical domains we use.

Fig. 6: Code base for P . Fig. 7: Over-approximating decis. tree at loc. <sup>3</sup> of P . Fig. 8: Precise decision tree at loc. <sup>3</sup> of P .

#### **6 Evaluation**

*Implementation* We have developed a prototype lifted static analyzer, called SPLNum2Analyzer, that uses lifted abstract domains of tuples A<sup>K</sup> and decision trees T(CD, A). The abstract domains A for encoding properties of tuple components and leaf nodes as well as the abstract domain D for encoding linear constraints over numerical features are based on intervals, octagons, and polyhedra domains. Their abstract operations and transfer functions are provided by the APRON library [19]. Our proof-of-concept implementation is written in OCaml and consists of around 6K lines of code. The current front-end of the tool accepts programs written in a (subset of) C with #if directives, but without struct and union types. It currently provides only a limited support for arrays, pointers, and recursion. The only basic data type is mathematical integers. SPLNum2Analyzer automatically infers numerical invariants in all program locations corresponding to all variants in the given family. We use delayed widening and narrowing [7,24] to improve the precision of while-s.

*Experimental setup and Benchmarks* All experiments are executed on a 64-bit Intel-CoreTM i7-8700 CPU@3.20GHz <sup>×</sup> 12, Ubuntu 18.04.5 LTS, with 8 GB memory, and we use a timeout value of 300 sec. All times are reported as average over five independent executions. The implementation, benchmarks, and all results obtained from our experiments are available from: https://github.com/ aleksdimovski/SPLNUM2Analyzer. In our experiments, we use three instances of our lifted analysis via tuples: AΠ(I), AΠ(O), and AΠ(P), and via decision trees: A<sup>T</sup>(I), A<sup>T</sup>(O), and A<sup>T</sup>(P), which use intervals, octagons, and polyhedra domains as parameters, respectively.

SPLNum2Analyzer was evaluated on a dozen of C programs collected from several categories of the 8th International Competition on Software Verification (SV-COMP 2019, https://sv-comp.sosy-lab.org/2019/): loops, loop-invgen (invgen for short), loop-lit (lit), termination-crafted (crafted); as well as from the real-world BusyBox project (https://busybox.net). In the case of SV-COMP, we have first selected some numerical programs with integers, and then we have manually added variability (features and #if directives) in each of them. In the case of BusyBox, we have first selected some programs with numerical features, and then we have simplified those programs so that our tool can handle them. For example, any reference to a pointer or a library function is replaced with [−∞, +∞]. Table 1 presents characteristics of the benchmarks. We


Table 1: Performance results for lifted static analyses based on decision trees vs. tuples (which are used as baseline). All times are in seconds.

list: the file name (Benchmark), the category (folder), the number of features and configurations (|F|, <sup>|</sup>K|), and lines of code (LOC).

*Performance Results* Table 1 shows the results of analyzing our benchmark files by using different versions of our lifted static analyses based on decision trees and on tuples. For each version of decision tree-based lifted analysis, there are two columns. In the first column, Time, we report the running time in seconds to analyze the given benchmark using the corresponding version of lifted analysis based on decision trees. In the second column, Impr., we report the speed up factor for each version of lifted analysis based on decision trees relative to the corresponding baseline lifted analysis based on tuples (A<sup>T</sup>(I) vs. AΠ(I), A<sup>T</sup>(O) vs. AΠ(O), and A<sup>T</sup>(P) vs. AΠ(P)). The performance results confirm that sharing is indeed effective and especially so for large values of <sup>|</sup>K|. On our benchmarks, it translates to speed ups (i.e., (A<sup>T</sup>(−) vs. AΠ(−)) that range from 1.1 to 4.6 times when <sup>|</sup>K|<100, and from 3.7 to 32 times when <sup>|</sup>K|>100.

*Computational tractability* The tuple-based lifted analysis AΠ(−) may become very slow or even infeasible for very large configuration spaces <sup>|</sup>K|. We have tested the limits of <sup>A</sup>Π(P) and <sup>A</sup><sup>T</sup>(−). We took a method, test<sup>k</sup> <sup>n</sup>(), which contains n numerical features A1,..., An, such that each numerical feature A<sup>i</sup> has domain dom(Ai) = [0, k <sup>−</sup> 1] = {0,...,k <sup>−</sup> <sup>1</sup>}. The body of test<sup>k</sup> <sup>n</sup>() consists of n sequentially composed #if-s of the form #if (A<sup>i</sup> = 0) i := i+1 #else i := 0 #endif For example, test<sup>3</sup> <sup>2</sup>() with two features A<sup>1</sup> and A2, whose domain is [0, 2], is:

> <sup>1</sup> int i := 0; <sup>2</sup> #if (A<sup>1</sup> = 0) i := i+1 #else i := 0 #endif <sup>3</sup> #if (A<sup>2</sup> = 0) i := i+1 #else i := 0 #endif <sup>4</sup>

Fig. 9: <sup>A</sup>Π(P) results at <sup>4</sup> of test<sup>3</sup> <sup>2</sup>(). Fig. 10: <sup>A</sup><sup>D</sup>(P) results at <sup>4</sup> of test<sup>3</sup> <sup>2</sup>().

Subject to the chosen configuration, the variable i in location <sup>4</sup> can have a value in the range from value 2 when A<sup>1</sup> and A<sup>2</sup> are assigned to 0, to value 0 when <sup>A</sup><sup>2</sup> <sup>≥</sup> 1. The analysis results in location <sup>4</sup> of test<sup>3</sup> <sup>2</sup>() obtained using AΠ(P) and A<sup>T</sup>(P) are shown in Fig. 9 and Fig. 10, respectively. AΠ(P) uses tuples with 9 interval properties (components), while A<sup>T</sup>(P) uses 3 interval properties (leafs).

Table 2: The performance results of analyzing test<sup>k</sup> n.


We have generated methods test<sup>k</sup> <sup>n</sup>() by gradually increasing variability. In general, the size of tuples used by <sup>A</sup>Π(P) is <sup>k</sup><sup>n</sup>, whereas the number of leaf nodes in decision trees used by A<sup>T</sup>(P) in the final program location is n + 1. The performance results of analyzing test<sup>k</sup> <sup>n</sup>, for different values of n and k, using <sup>A</sup>Π(P) and <sup>A</sup><sup>T</sup>(P) are shown in Table 2. In the columns Impr., we report the speed-up of A<sup>T</sup>(P) with respect to AΠ(P). We observe that A<sup>T</sup>(P) yields decision trees that provide quite compact and symbolic representation of lifted analysis results. Since the configurations with equivalent analysis results are nicely encoded using linear constraints in decision nodes, the performance of A<sup>T</sup>(P) does not depend on k, but only depends on n. On the other hand, the performance of AΠ(P) heavily depends on k. Thus, within a timeout limit of 300 seconds, the analysis <sup>A</sup>Π(P) fails to terminate for test<sup>3</sup> <sup>11</sup>, test<sup>5</sup> <sup>8</sup>, and test<sup>7</sup> <sup>6</sup>. In summary, we can conclude that decision trees A<sup>T</sup>(P) can not only greatly speed up lifted analyses, but also turn previously infeasible analyses into feasible.

#### **7 Related Work**

*Decision-tree abstract domains* have been successfully used in the field of abstract interpretation recently [18,9,4,26]. Decision trees have been applied for the disjunctive refinement of Interval domain [18]. That is, each element of the new domain is a propositional formula over interval linear constraints. Segmented decision tree abstract domains has also been defined [9,4] to enable path dependent static analysis. Their elements contain decision nodes that are determined either by values of program variables [9] or by the branch (if) conditions [4], whereas the leaf nodes are numerical properties. Urban and Mine [26] use decision tree-based abstract domains to prove program termination. Decision nodes are labelled with linear constraints that split the memory space and leaf nodes contain affine ranking functions for proving program termination.

Recently, two main styles of static analysis have been a topic of considerable research in the SPL community: *a dataflow analysis from the monotone framework* developed by Kildall [21] that is algorithmically defined on syntactic CFGs, and *an abstract interpretation-based static analysis* developed by Cousot and Cousot [7] that is more general and semantically defined. Brabrand et. al. [3] lift a dataflow analysis from the *monotone framework*, resulting in a tuple-based lifted dataflow analysis. Another efficient implementation of the lifted dataflow analysis from the monotone framework is based on using variational data structures [27]. Midtgaard et. al. [22] have proposed a formal methodology for systematic derivation of tuplebased lifted static analyses in the *abstract interpretation framework*. A more efficient lifted static analysis by abstract interpretation obtained by improving representation via BDD domains is given in [11]. Another approach to speed up lifted analyses is by using so-called variability abstractions [14,15], which are used to derive abstract lifted analyses. They tame the combinatorial explosion of the number of configurations and reduce it to something more tractable by manipulating the configuration space. The work [5] presents a model checking technique to analyze probabilistic program families.

#### **8 Conclusion**

In this work we employ decision trees and widely-known numerical abstract domains for automatic inference of invariants in all locations of C program families that contain numerical features. In future, we would like to extend the lifted abstract domain to also support non-linear constraints [17]. An interesting direction for future work would be to explore possibilities of applying variability abstractions [14] as yet another way to speed up lifted analyses. We can also define a backward lifted analysis in combination with a preliminary forward lifted analysis to infer the necessary preconditions in order a given assertion to be satisfied or violated. The obtained preconditions in the form of linear constraints can be analyzed using model counting techniques to quantify how likely is an input or a variant to satisfy them [16,12].

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### Finding a Universal Execution Strategy for Model Transformation Networks*-*

Joshua Gleitze , Heiko Klare(-) , and Erik Burger

KASTEL, Karlsruhe Institute of Technology, Karlsruhe, Germany joshua.gleitze@student.kit.edu, klare@kit.edu, burger@kit.edu

Abstract. When using multiple models to describe a (software) system, one can use a network of model transformations to keep the models consistent after changes. No strategy exists, however, to orchestrate the execution of transformations if the network has an arbitrary topology. In this paper, we analyse how often and in which order transformations need to be executed. We argue why linear execution bounds are too restrictive to be useful in practice and prove that there is no upper bound for the number of necessary executions. To avoid non-termination, we propose a conservative strategy that makes execution failures easier to understand. These insights help developers and users of transformation networks to understand under which circumstances their networks can terminate. Additionally, the proposed strategy helps them to find the cause when a network cannot restore consistency.

Keywords: model consistency · model transformation networks

#### 1 Introduction

When modelling systems, one is often confronted with the task of *model consistency*: Since model-driven development aims at separating concerns by tailoring models to the needs of the people working on the system, there are typically different models, each one capturing the parts of the system that are relevant to the model's target audience. All those models taken together should describe a coherent system and not contain contradictory information. We say that the models should be consistent. Automatic detection and resolution of inconsistencies is, however, still poorly addressed in current development processes [12].

There are different means of maintaining consistency. A popular one is to define *incremental model transformations*, which update models based on information that was changed in one of them. While there has been significant research on model transformations themselves, particularly on binary transformations, maintaining consistency of multiple models is less researched [2]. There are approaches for multiary model transformations which can transform between multiple models by means of a single transformation. Nevertheless, one will likely

<sup>-</sup> This work was supported by funding of the Helmholtz Association (HGF) through the Competence Center for Applied Security Technology (KASTEL).

<sup>©</sup> The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 87–107, 2021. https://doi.org/10.1007/978-3-030-71500-7\_5

also want to be able to combine multiple transformations—binary or multiary—to maintain consistency, creating a *transformation network*. Unlike using a single, overarching transformation, defining a network makes it possible to reuse modular ones. Additionally, knowledge about consistency between certain types of models is often distributed across domain experts [13]. This can be accommodated by transformation networks, because every domain expert can define transformations independently and according to their view on consistency.

To the best of the authors' knowledge, no strategy that determines an execution order of transformations to maintain consistency in a network with arbitrary topology has been presented yet. Existing work proposes, for example, defining an execution order explicitly [23, 35] or deriving a topological order [30]. Most approaches restrict the supported kinds of network topologies to such in which each transformation only needs to be executed once.

In this paper, we research properties and limitations of a universal strategy that executes a transformation network of arbitrary topology. We show that strategies that apply each transformation only once are not useful in practice. At the other end of the spectrum, we prove that not limiting the number of transformation executions does, in general, lead to non-termination. Based on the insight that a universal strategy can only operate conservatively, we derive a practicable strategy. In detail, we make the following contributions:


The contributions establish fundamental knowledge about the design space of network execution strategies, their undecidability, and difficulties in reducing conservativeness. The proposed strategy helps transformation network developers and users to find the reasons when an execution does not yield consistent models.

### 2 Problem Statement

In this section, we will further motivate our research by giving an example and clarifying its context. We provide a formalisation for transformation networks and execution strategies to generate a common understanding and formal basis for transformation network orchestration, constituting contribution *C1* .

### 2.1 Motivating Example

Figure 1 depicts a software project whose contributors take the roles of architects, developers and user experience (UX) designers. One person can take multiple roles, but every role has a particular view on the project and uses related tools. Architects use a UML-based tool to analyse and plan the architecture. Developers

Fig. 1. Example for a transformation network in model-driven (software) development.

program the software in Java. These two models overlap: Although they cannot be derived completely from each other, the implementation should follow the architecture and architects want to see how code changes affect the architecture.

UX designers develop the UI for the software. Their designs overlap with the UML model, because, first, the software's requirements mandate certain properties of the UI, and, second, the architecture may restrict which information can be shown at which point in the interface. The UI design also overlaps with the code, since static parts of the UI can be derived from the UI model. Ideally, changes in the UI code can even be propagated back into the UI model.

The developers use OpenAPI™ [32] to exchange specifications of HTTP APIs. These specifications overlap with the parsing and serialisation code. Architects want to analyse how their architecture choices influence performance, using the Palladio Component Model (PCM) [24]. The architecture specification used in the PCM overlaps with the one defined in UML. Additionally, the PCM model contains information about performance properties and the deployment structure, which can partially be derived from the code.

Those relations can be encoded in transformations to avoid re-specification of similar information, such as the architecture in PCM and UML, to derive information, like appropriate Java stubs from OpenAPI specifications, and to preserve information consistency. Figure 1 shows the resulting transformation network. In this paper, we will find an execution strategy for such transformations, which is needed to correctly propagate changes from one model to the others.

#### 2.2 Context

We discuss model transformation networks in a specific usage context. We assume that different roles are involved in a development project, each using some models to describe their view of the system. The models are kept consistent by model transformations. For the sake of simplicity, we only discuss *binary* transformations between two models. To foster independent specification and reuse of transformations, we assume that they are not tailor-made, but may be generalpurpose. As a consequence, we cannot assume that the models or transformations are or can be aligned, for example, to ensure that their execution in a specific

order always results in consistent models. Neither can we assume that the network has a certain topology. We do, however, assume that all transformations are in accordance to a well-defined overall notion of consistency (reaching a consistent state would be impossible otherwise). This means that all requirements we pose on the transformations must only concern a transformation itself. A requirement like "no transformation overwrites the result of another" would not fit our context.

We require that transformations are *synchronising* [4], i.e., that they can deal with the situation that both of their models have been changed. This is essential to find an execution strategy: When propagating changes in a transformation network that contains cycles, it will inevitably happen that both models that are connected by a transformation will be changed. In addition, the well-researched *bidirectional* transformations only change one of the models [28] and could in such a situation be forced to overwrite changes to yield a consistent result. This assumption also enables concurrent modifications by different project members.

#### 2.3 Formalisation

We are not concerned with how models are structured, so we simply resort to defining a universe M that contains all models. First, we define the kind of transformations that we use:

Definition 1. *A* synchronising binary transformation (syncx) t *is a function that updates two models:*

$$\left(\overline{\mathbb{M}} \times \mathbb{M}\right) \to \left(\mathbb{M} \times \mathbb{M}\right)$$

*A syncx' image consists of fixed points:*

$$\forall a \in \mathbb{M} \,\forall b \in \mathbb{M}: \tilde{t}(\tilde{t}(a,b)) = \tilde{t}(a,b)$$

*The universe of all syncx for* M *is called* T*.*

This formalisation is a simplification sufficient for the purposes of this paper. In practice, transformations will, for example, be allowed to indicate an error instead of being required to always produce appropriate new models.

In comparison to existing formalisms [28], there is no consistency relation in the definition of a syncx. For our purposes, the consistency relation is not part of a syncx, but rather encoded implicitly in the syncx' behaviour. We assume that the transformations are correct and hippocratic [28] with regard to their implicit consistency relation and can then recover the relation:

Definition 2. *The* consistency relation Rt *of syncx* t *is given by:*

$$R\_{\overline{t}} = \left\{ (a, b) \mid \overline{t}(a, b) = (a, b) \right\}.$$

This paper focuses on transformation networks that are created when combining multiple syncx:

Definition 3. *A* transformation network N =: ((V,E), T) *consists of a directed, connected, self-loop-free graph* <sup>G</sup> <sup>=</sup> (V,E) *and a syncx assignment* <sup>T</sup> : <sup>E</sup> <sup>→</sup> <sup>T</sup>*. Any two vertices* {a, b} ⊆ V *have at most one edge between them:* (a, b) ∈ E =⇒ (b, a) <sup>∈</sup>/ <sup>E</sup>*. The universe of all model transformation networks for* <sup>M</sup> *is called* <sup>U</sup>*.*

A transformation network captures the topology and the used transformations. There is no inherent reason to exclude multigraphs or self-loops. We use this simpler definition because it makes it easier to argue about the networks without restricting expressiveness. We use directed edges instead of undirected ones to provide a notion of the "left" and "right" model for a syncx. The edges' direction does not indicate anything about the direction of change propagation. We will usually regard the network as given and try to find suitable model assignments:

Definition 4. *For a transformation network* N =:((V,E), T)*, a* model assignment <sup>M</sup> *is a function* <sup>M</sup> : <sup>V</sup> <sup>→</sup> <sup>M</sup>*.*

Naturally, we are particularly interested in model assignments that are consistent with the transformations:

Definition 5. *For a transformation network* N =:((V,E), T)*, a model assignment* M *is* consistent *if, and only if*

∀(a, b) ∈ E : (M(a), M(b)) ∈ R<sup>T</sup>(a,b)

*The set of all consistent model assignments for* N *is called* R<sup>N</sup> *.*

We use the following additional notation in this paper:


– "Im(f)" to denote the image of a function f

#### 2.4 Problem Description

Our goal is to find an algorithm that, given a transformation network N =: ((V,E), T) <sup>∈</sup> <sup>U</sup> and a model assignment <sup>M</sup>, finds a consistent model assignment M by applying transformations in Im(T). We call such an algorithm a "(transformation network) execution strategy". It is "universal" if it is parametrised by and thus defined for every network.

Definition 6. *A universal execution strategy determines an order (i.e., a permutation with duplicates) of transformations in* Im(T) *for a given transformation network* <sup>N</sup> =:((V,E), T) <sup>∈</sup> <sup>U</sup> *and model assignment* <sup>M</sup> <sup>∈</sup> (<sup>V</sup> <sup>→</sup> <sup>M</sup>)*. It realises a partial function* <sup>S</sup> : <sup>U</sup> <sup>×</sup> (<sup>V</sup> <sup>→</sup> <sup>M</sup>) '→ (<sup>V</sup> <sup>→</sup> <sup>M</sup>)*.*

An execution strategy finds a new model assignment only by executing the transformations of the network, as more precisely defined by Klare et al. [15, Definition 8]. If S(N,M) = ⊥, we say that the strategy "resolves" N and M. If S(N,M) = ⊥, we say that the strategy fails. We have further requirements:

Requirement 1. *An execution strategy must be correct:*

$$\forall N \coloneqq ((V, E), T) \in \mathbb{U} \; \forall M \in (V \to \mathbb{M}) : S(N, M) \in R\_N \cup \{\perp\}$$

Requirement 2. *An execution strategy must be hippocratic:*

<sup>∀</sup><sup>N</sup> =:((V,E), T) <sup>∈</sup> <sup>U</sup> <sup>∀</sup>M<sup>c</sup> <sup>∈</sup> <sup>R</sup><sup>N</sup> : <sup>S</sup>(N,Mc) = <sup>M</sup><sup>c</sup>

An execution strategy will not always be able to find a consistent new model assignment (i.e., there will be some N,M such that S(N,M) = ⊥). First, there may not be a consistent model assignment at all (i.e., R<sup>N</sup> = ∅). Second, there may be a consistent model assignment but no execution order of the transformations that yields that assignment [30, 16]. We call such inputs "unresolvable" [30]. Conversely, if there is an execution order of the transformations that yields a consistent model assignment, we call the inputs "resolvable".

An execution strategy may even fail for resolvable inputs: The execution strategy may not "find" a consistent model assignment, even though it is reachable. For example, the strategy may abort before having executed the transformations often enough, or finding the assignment might require an order of execution which the strategy does not consider. We call such a strategy "conservative":

Definition 7. *An execution strategy* S *is* conservative *if it is correct and if there can be resolvable inputs* N,M *with* S(N,M) = ⊥*.*

The higher the probability that an execution strategy yields a result for resolvable inputs (we also say the lower its "level of conservativeness"), the more useful the strategy will be. It is, however, also desirable that the strategy is predictable, meaning that one can determine beforehand for which inputs the strategy will succeed. For example, it would be useful to know whether a strategy yields a result for a given network for *any* resolvable model assignment. Informally speaking, we would like to have an "easy-to-check" criterion for transformation networks determining whether this is the case. An even better criterion could be applied to a single syncx, such that the strategy can resolve all inputs with a network of syncx that fulfil the criterion. This would be ideal for the motivated context of independently developing and freely combining syncx to a network.

To summarise, we aim to find a correct, hippocratic execution strategy that is able to keep models consistent via transformation networks. The strategy should succeed for realistic inputs with a high probability. Additionally, we aim to find criteria that determine the cases in which the strategy will succeed.

# 3 Related Work

Approaches for restoring model consistency have been subject to intensive research, surveyed by Macedo et al. [21]. Model transformations are a well-researched option, and several tools and languages have been developed to support them [27, 18, 25]. Research has, however, mainly focused on consistency between two models, which also concerns theoretical properties like *termination* as one of the properties that we investigate for the execution of transformation networks [7]. Maintaining consistency between more than two models has recently gained more attention, especially in terms of a dedicated Dagstuhl seminar [2]. The central approaches of multiary transformations and networks of binary transformations can be distinguished. In Section 1, we have discussed that multiary transformations are complex to specify, whereas networks of binary transformations have limited expressiveness [30], which does, however, not seem to be practically relevant [2].

*Multiary Transformations:* Different approaches for multiary transformations have been proposed. QVT-R [22] supports multidirectionality already by design, but ambiguities in the standard limit practical applicability [20]. Triple Graph Grammars (TGGs) [26] are bidirectional specifications, which are well-suited for model transformations [1]. Extensions of TGGs to multiple models called Multi Graph Grammars (MGGs) [17] and Graph Diagram Grammars [34, 33] consider the specification of multidirectional rules. All these approaches, however, require the transformation developer to know about and be able to express the relations between all involved models, which we reasonably excluded by assumption.

*Auxiliary Models:* Not all multiary relations can be expressed by sets of binary ones. Adding one auxiliary model makes it, however, theoretically possible to express arbitrary multiary relations by binary ones [30]. Some work discussed which kinds of relations can be expressed with such an approach and how they can be formalised in the lenses framework [5, 31]. Other work discussed how composing such auxiliary models to express commonalities of models can be achieved [14]. Such auxiliary models actually encode a multiary transformation in a model together with binary transformations to the models to keep consistent, resulting in the same challenges as for transformation network. In consequence, our work on transformation networks is also required and applicable there.

*Binary Transformations:* Although they cannot express all multiary relations, there are arguments in favour of using networks of modular transformations, especially binary ones: They are easier to develop when domain knowledge is distributed [13] and they are easier to comprehend by a single developer [2, 30]. Additionally, binary transformations are researched well and a variety of tools supporting different kinds of specifying them exist [27, 18, 25, 21]. Most formalisms and tools consider *bidirectional* transformations, whereas networks require synchronising transformations, as motivated in Section 2.2. Non-synchronising transformations can, however, be adapted to become synchronising [37].

*Transformation Chains:* Transformation chains combine transformations to derive low-level models from high-level ones across intermediate representations. Languages like FTG+PM [19] and UniTI [35] enable the specification of such chains. Transformation chains are, however, only a special case of general transformation networks. Etien et al. consider specific properties of transformation chains. They investigate how conflicts in terms of results depending on the execution order can be detected [8]. These results do, however, not aim to relieve developers from the task of finding an execution order manually, as we do in this paper.

*Transformation Composition:* Transformation composition techniques are a means to build networks of binary transformations. They can be separated into internal, white-box approaches [36], and external techniques, which consider transformations as black-boxes. Our contributions can be seen as an external composition technique. However, composition usually considers transformations between the same rather than different types of models. From a theoretical perspective (see Section 2.3) this could be treated equally by not distinguishing models by their metamodels. Practical approaches, however, consider transformations between specific metamodels rather than arbitrary models.

Fig. 2. Example yielding inconsistent models after executing each transformation once. Numbers in italics indicate the order in which changes are performed.

*Execution Strategies:* Di Rocco et al. [3] describe a simple strategy for orchestrating transformations, but make strong assumptions requiring that each of them is only applied once. Stevens [30] proposes a strategy that also executes each transformation only once in one direction. It includes a notion of authoritative models, which are not allowed to be changed, and does not consider synchronising transformations. Likewise, Stevens [29] proposes to find an *orientation model* defining in which direction transformations are executed. If, however, several transformations modify the same model, the approach leaves it to the developer to determine an execution order after which all consistency relations hold. Such strategies are only correct if the network is a tree, or if no transformations interfere with each other. We present a simple scenario in which this is already too limiting in Section 4.1. We overcome this limitation by executing transformations more than once and thereby letting them "negotiate" a result even if they interfere, which yields a *universal* execution strategy for arbitrary network topologies.

#### 4 Design Space

We approach the possibilities for designing an execution strategy by looking at how often it executes syncx in the worst case. We consider the two extremes of executing every syncx at most once and executing them an unlimited number of times, and find that neither of them will do: While the first one is too limiting, the second one cannot guarantee termination. As a consequential insight, a universal execution strategy needs to be *conservative*, introduced as contribution *C2* .

#### 4.1 One Execution per Transformation

Several proposed strategies execute every transformation in a network at most once [30, 35]. Since we expect that transformations are developed independently, and are thus not necessarily aligned (see Section 2.2), restricting the number of executions to one per transformation would, however, limit the possible combinations of them, and models could not be kept consistent in desirable scenarios. We give an example for this in the following.

$$\{\bigodot\varprojlim^{\stackrel{\mathbb{\vee}}{i\_{2}}}\bigodot\varprojlim^{\stackrel{\mathbb{\vee}}{i\_{4}}}\cdots\cdots\x\xleftarrow{\stackrel{\mathbb{\vee}}{i\_{n}}}\bigodot\varprojlim^{\stackrel{\mathbb{\vee}}{i\_{1}}}\bigodot\varprojlim^{\stackrel{\mathbb{\vee}}{i\_{3}}}\cdots\cdots\x\x\xleftarrow{\stackrel{\mathbb{\vee}}{i\_{n-1}}}\bigodot\varprojlim^{\stackrel{\mathbb{\vee}}{i\_{1}}}\cdots$$

Fig. 3. A transformation network with *n* transformations reacting to each other.

We use the example of Section 2.1, and focus on the UML, Java and OpenAPI models to consider the scenario visualised in Figure 2: An architect creates a new UML interface and applies an execution strategy that executes every transformation once. First, the UML-to-Java syncx creates an appropriate interface in Java. The OpenAPI-to-Java syncx recognises that the interface should be exposed via an HTTP API and creates a matching endpoint in the OpenAPI model. Additionally, it creates a stub implementation with parsing and serialisation code in Java. The stub implementation classes can, however, not be propagated back to UML, because the UML-to-Java syncx has already been executed.

We see that if we limit the number of executions to one per transformation, transformations cannot propagate back the changes that other transformations have made. However, in the context described in Section 2.2, it is necessary that transformations are able to "react" to the changes made by other transformations. This offers, for instance, separation of concerns: The logic for a certain aspect of consistency can be put in only one transformation and other transformations will propagate it throughout the network. Without such a mechanism, all aspects of consistency would need to be implemented in all transformations. This would cause duplication of logic and reduce reusability of transformations, which would be impractical and contradicts our assumption of independent development. If we added the logic for creating implementations of relevant Java interfaces to the UML-to-Java syncx, then it would implicitly assume the presence of the Java-to-OpenAPI syncx. It could, thus, not be easily reused in networks where the Java-to-OpenAPI syncx is not used.

We can generalise the previous example: Let the model universe be the natural numbers: <sup>M</sup> <sup>=</sup> <sup>N</sup>0. Let further for any <sup>1</sup> <sup>≤</sup> <sup>j</sup> <sup>≤</sup> <sup>n</sup> the syncx <sup>i</sup> <sup>j</sup> be defined as

$$\tilde{i}\_j \colon (a,b) \mapsto \begin{cases} (m+1, m+1) & \text{if } m=j\\ (m,m) & \text{else} \end{cases} \qquad \text{with } m := \max\{a,b\}$$

i <sup>j</sup> sets both models to the higher number of the two, except if that number is j. Then i <sup>j</sup> increments the result by one. This is an abstraction of syncx "reacting" to each other: The i <sup>j</sup> s seek to set all models to the same value, except that after i <sup>j</sup>−<sup>1</sup> was executed, i <sup>j</sup> changes its behaviour and increments the value by one.

We now construct the transformation network <sup>N</sup><sup>n</sup> for <sup>n</sup> = 2k, k <sup>∈</sup> <sup>N</sup><sup>+</sup> (see Figure 3) with n indicating the number of syncx within the network, and examine how many executions it requires:

$$\begin{aligned} T\_n &= (i, i+1) \mapsto \begin{cases} \tilde{i}\_{2i} & \text{if } i \le \frac{n}{2} \\ \tilde{i}\_{2i-n-1} & \text{else} \end{cases} \\ N\_n &= (([1, n+1], \{(i, i+1) \mid i \in [1, n]\}), T\_n) \end{aligned}$$

Lemma 1. i <sup>n</sup> *must be executed at least* n *times to resolve* N<sup>n</sup> *with the initial model assignment*

$$M\_1 \colon i \mapsto \begin{cases} 1 & \text{if } i = 1 \\ 0 & \text{else} \end{cases}$$

*Proof.* The only reachable model assignment that is consistent is M<sup>n</sup> : i '→ n. It is reached by having every i <sup>j</sup> increment the highest number in the model assignment by one if that highest number currently is j. All transformations incrementing even numbers are on one side of i <sup>n</sup> (except for i <sup>n</sup> itself), all transformations incrementing uneven numbers are on the other side. Thus, the currently highest number must be propagated to the other side of i <sup>n</sup> at least n−1 times. Additionally, i <sup>n</sup> must increment n − 1 to n.

Theorem 1. *For any execution strategy that uses* O(1) *executions of each transformation, there are inputs that the execution strategy cannot resolve.*

*Proof.* Follows directly from Lemma 1.

The example network in Figure 2 is a simplification of a realistic transformation scenario, which we generalised to the network Nn. In consequence of Theorem 1, we can expect that transformation networks can, in general, not be resolved with O(1) executions of each transformation.

#### 4.2 Unlimited Executions

We now consider an execution strategy that executes transformations as long as they still change models, and terminates once no more changes occur. This overcomes the shortcoming that we observed with limiting the number of executions to a constant; we will, however, see that we cannot guarantee termination of such an execution strategy. By simulating Turing machines with transformation networks, we prove that it is undecidable whether the strategy will terminate.

Given a Turing machine tm over some alphabet Σ, we construct a transformation network Ntm =: ((V,E), Ttm) and a model assignment Mtm,x that are resolvable if, and only if, tm halts on input <sup>x</sup> <sup>∈</sup> <sup>Σ</sup>∗. We assume that tm contains no self-loops as well as no cycles of length 2, i.e., that each transition and each sequence of two transitions changes the state of tm. This is without loss of generality, since duplication and triplication of each state resolves such self-loops and cycles, respectively. The constructed models consist of a timestamp, the tape content and the tape position (i.e., <sup>M</sup> <sup>=</sup> <sup>N</sup><sup>0</sup> <sup>×</sup> <sup>Σ</sup><sup>∗</sup> <sup>×</sup> <sup>N</sup>0). The network Ntm has tm's states as vertices and exactly one directed edge (in arbitrary direction) between each pair of states having a transition between them. The transformations increment the timestamp, change the tape content and update the tape position according to tm's transition if, and only if, the source model's timestamp is higher than the target model's timestamp. More formally, let Tr(a, b) ⊆ Σ × {−1, 0, 1} × Σ be the transitions defined between the states a

and b (with −1, 0 and 1 indicating the head movements "left", "stay" and "right"). We define Ttm with w|<sup>p</sup>←<sup>r</sup> := w[0 .. p−1] · r · w[p+1 .. |w|−1] such that:

$$\forall (a,b) \in E: T\_{\operatorname{TM}}(a,b)(\alpha = (t\_a, w\_a, p\_a), \beta = (t\_b, w\_b, p\_b))$$

$$= \begin{cases} (\alpha, (t\_a+1, w\_a|\_{p\_a \leftarrow r}, p\_a+d)) & \text{if } t\_a > t\_b \land \exists \left(w\_a[p\_a], d, r\right) \in \operatorname{Tr}(a,b) \\\\ ((t\_b+1, w\_b|\_{p\_b \leftarrow r}, p\_b+d), \beta) & \text{if } t\_a < t\_b \land \exists \left(w\_b[p\_b], d, r\right) \in \operatorname{Tr}(b,a) \\\\ (\alpha, \beta) & \text{else} \end{cases}$$

Let s be the initial state of tm. We set

$$M\_{\mathrm{TM},x} \colon v \mapsto \begin{cases} (1,x,0) & \text{if } v=s\\ (0,\varepsilon,0) & \text{else} \end{cases}$$

Lemma 2. *Executing the transformations of* Ntm*, with initial model assignment* Mtm,x*, until no transformations change the model assignment anymore terminates if, and only if,* tm *halts on input* x*. If executing the transformations terminates with the final model assignment* M<sup>f</sup> *, then the model with the highest timestamp in* Im(Mi) *contains* tm(x) *as tape content.*

*Proof.* We can see by induction over the model assignments <sup>M</sup>i, i <sup>∈</sup> <sup>N</sup><sup>0</sup> created while executing the transformations:


Theorem 2. *Let* S *be an execution strategy that executes transformations until a consistent model assignment is reached. There are inputs for which it can not be decided whether* S *will terminate.*

*Proof.* It follows from Lemma 2 that deciding whether S terminates could decide the halting problem for a universal Turing machine.

Even worse, this construction makes it unlikely that we will find a practicable criterion that ensures success of an execution strategy like we have motivated in Section 2.4. Because we want the criterion to apply to a single syncx, it would need to restrict the syncx so much that it makes building a network simulating Turing machines out of the syncx impossible. But since the definition of the syncx in Im(Ttm) is structurally simple, it seems unlikely that a syncx fulfilling the hypothetical criterion would still be apt for most practical use cases.

We could avoid undecidability if we restricted the models' size. The models could then no longer store an unbounded tape and, thus, only simulate spacerestricted Turing machines. There is, however, no reasonable bound for a *necessary* model size, to which they could be limited. In consequence, determining a universal space bound for models would be an arbitrary and thus impractical restriction.

Finally, one could question whether it is relevant if an execution strategy can be guaranteed to terminate. Execution strategies will be used to tell users whether changes they made can be incorporated into the other models automatically. In consequence, users should reliably and timely get a response. We might compare this situation to merging changes in version control systems. There, users also want a reliable and timely response on whether their changes could be incorporated automatically, or whether they need to resolve conflicts manually.

# 5 Proposed Strategy

As a consequence of the previous findings, every universal execution strategy will be *conservative*: there will be inputs for which it fails, even though there would have been an execution order leading to a consistent model assignment. In this section, we discuss how to find an appropriate execution order and bound, and finally present the "explanatory strategy", constituting contribution *C3* .

#### 5.1 Execution Order: Providing Explainability

Increasing the number of transformation executions an execution strategy permits, lowers its level of conservativeness. In contrast, the effects of different orders in which transformations can be executed are not as easy to categorise. The authors developed a model transformation network simulator [11], whose source code is available at GitHub [10]. It allows to construct transformation networks and to define execution strategies, which can be applied step by step. All examples presented in this paper are also modelled in the simulator. For each examined systematic execution order, such as a depth-first or breadth-first selection, the authors found categories of networks on which the order performed worse than another one in terms of conservativeness. In consequence, conservativeness is not a good sole criterion to evaluate orders by.

We know that a universal execution strategy will inevitably be conservative, i.e., possibly fail for resolvable inputs. In practice, it will be important how well an execution strategy provides explainability in such cases, i.e., helps users to understand where and why the strategy failed with the selected execution order. The order plays a decisive role in this regard, which is why we focus on finding a strategy that improves the order. Imagine, for instance, that the strategy executed transformations in an arbitrary order until some limit is reached. Users might then be confronted with a situation where all transformations have been executed,

but the last model assignment is only consistent with some of them. There would be no clear pattern and little clues for users where to start investigating the failure's cause. To improve explainability, the authors thus propose the following principle for an execution order:

Principle 1. *Ensure consistency among the transformations that have already been executed before executing a transformation that has not been executed yet.*

Since a syncx can change both models, executing it may results in models that are inconsistent with the syncx that have been executed previously. Following Principle 1, these inconsistencies should be addressed first. In effect, a strategy applying the principle will maintain a subnetwork of syncx with a consistent model assignment and try to expand the subnetwork transformation by transformation.

To exemplify how Principle 1 provides *explainability*, suppose that an execution strategy applying that principle fails after having executed the set of syncx <sup>E</sup> <sup>⊆</sup> <sup>T</sup>. Let t ∈ E be the last syncx that was executed for its first time. The strategy can then inform users that integrating t into the subnetwork induced by E failed. Furthermore, it can inform users that a result that is consistent with the syncx in E \ {t } exists. By that, users gain valuable information for handling the error: First, when trying to understand the error, they can ignore any syncx that is not in E. Second, some aspect of consistency that is present in the consistency relation realised by t , but absent in the consistency relations realised by the syncx in E \ {t }, hinders the strategy from creating a consistent result. Third, when users try to find a consistent model assignment manually, they can start with the consistent result that exists for E \ {t } instead of having to start from scratch.

#### 5.2 Execution Bound: Reacting to Each Other

As we have seen, we need to restrict the number of transformation executions with a function in ω(m) (m being the number of syncx in the input network). Such a limit must be reasonable to support most practical use cases: Not allowing enough transformation executions reduces the usefulness of the strategy since not all useful networks can be resolved. Allowing too many executions might make the strategy run for a long time before aborting, without adding much value.

In Section 4.1, we have motivated that syncx should be able to "react" to each other. We have seen that this excludes any bound in O(1) for the number of executions per transformation, but to guarantee termination we can also not allow transformations to react to each other indefinitely. If a syncx t changes the models and the other already executed syncx have reacted to those changes by adapting the models to be consistent with them as well, t should not react by changing the models again. Because if t changed the models again, this could easily result in executing the same sequences of transformations repeatedly and there would likely be no consistent result.

We call transformations that behave in the described way N*-converging*. This is not a property of a syncx on its own but relative to its network N. Thus, it cannot be achieved just by proper construction of an individual transformation.


There is, unfortunately, also no simple way to check it statically. Nevertheless, it captures the sensible expectation for transformations explained above. We yield an execution bound for a strategy by only requiring it not to fail if all syncx are N-converging. We will see how this execution bound behaves in combination with Principle 1 in the subsequently presented execution strategy.

Definition 8. *Let* N =:(G, T) *be a transformation network. A syncx* t ∈ Im(T) *is* N-converging *if for every initial model assignment and each subset of the syncx* T<sup>p</sup> ⊆ Im(T) *with* t ∈ T<sup>p</sup> *the resulting model assignment is consistent to* t *whenever* t *has been executed after a sequence of the syncx in* T<sup>p</sup> *that contains each permutation of those syncx as a (not necessarily continuous) subsequence.*

We only require that the sequence of transformation executions contains each permutation, but allow other executions in between. As an example, assume a network N of N-converging syncx t <sup>1</sup>, t <sup>2</sup> and t <sup>3</sup>. After executing them in the order t 1 t 2 t 3 t 1 t 2 t <sup>3</sup>, the current model assignment may still be inconsistent with t <sup>1</sup> because t <sup>1</sup> was not executed after the order t 3 t <sup>2</sup>. After executing t <sup>1</sup> once more, the resulting model assignment must now be consistent with all syncx: t <sup>1</sup> was executed after the two orders of other syncx t 2 t <sup>3</sup> and t 3 t <sup>2</sup>. Likewise, t <sup>2</sup> was executed after t 1 t <sup>3</sup> and t 3 t <sup>1</sup>, and t <sup>3</sup> was executed after t 1 t <sup>2</sup> and t 2 t 1.

#### 5.3 The Explanatory Strategy

We now turn to a concrete strategy that realises the discussed design choices. Algorithm 1 gives pseudocode for such a strategy, which we call the "explanatory

Fig. 4. Exemplary execution of the explanatory strategy for a change in the topmost model, depicting the iterations (horizontal) and recursion steps (vertical).

strategy". At a high level, it acts like this: Given a changed model assignment, the strategy picks the next candidate syncx to execute. After executing the candidate, the strategy calls itself on the subnetwork formed by the already executed syncx. By that, it propagates the changes of the last execution throughout the subnetwork and ensures that they are consistent with the executed syncx. Finally, the strategy executes the initial candidate again to ensure that the changes added during the subnetwork propagation are consistent with the candidate. If that repeated execution of the candidate generates new changes in any model that is kept consistent by an already executed syncx, the execution fails, because the candidate does not fulfil the definition of being N*-converging*, as we will see in the following. In that case, the procedure returns the already executed syncx to which consistency was restored by the also returned changes in order to support a user in examining the reasons for the strategy to fail. If the models are consistent with the candidate, the strategy picks the next one. In effect, the strategy realises Principle 1 in a recursive fashion and ensures that each permutation of all yet executed syncx is executed at every recursion level.

Figure 4 depicts an exemplary execution of the strategy for a network with four models and four transformations. We assume that after an initially consistent state of the models, the topmost one was modified. We can see that each recursion only treats the subnetwork of previously executed transformations. Hence, the network gets smaller at each recursion level.

Unlike the formalisation in Section 2.3, the presented algorithm is based on changes instead of model states. Changes contain information that cannot be recovered by comparing model states [6]. Thus in practice, we want to support change-based execution. The algorithm also uses changes to determine potential candidates for the next transformation to execute: It only picks candidates that are adjacent to a model that was changed. The input changes describe all changes that occurred since the last model assignment M that was known to be consistent. The procedure returns accumulatedChanges that, when applied to M, yield a new model assignment M . For our formalisation, M is the algorithm's output.

We discuss some implementation details for the explanatory strategy further below. First, we prove that the strategy has indeed the motivated properties. We assert that it terminates always and determine its execution bound.

#### Theorem 3. *The explanatory strategy terminates for every input.*

*Proof.* Because all called functions terminate, only the loop (Line 5) and the recursive call in Line 8 can lead to non-termination. Let m denote the number of edges of network. The set executed is initialised to be empty (Line 2) and grows by one element in every iteration of the loop. The loop is executed no more than m times, because after m iterations there is no transformation that is not in executed and, thus, the loop condition cannot be fulfilled.

The recursive call receives a network that is smaller than network in terms of edges, because it does not contain the current candidate. If network is empty, then the algorithm will not enter the loop and not make a recursive call. Hence, the recursive stack never gets higher than m.

# Theorem 4. *The explanatory strategy executes syncx at most* <sup>O</sup>(2<sup>m</sup>) *times.*

*Proof.* Let T(m) denote the number of syncx executions the algorithm invokes for a network with m edges. The set executed is initialised to be empty and grows by one syncx every loop iteration (Line 13). It follows that the recursive call in Line 8 receives a network that is one syncx larger each time. Thus, we find

$$T(0) = 0,\ T(m) = 2m + \sum\_{i=0}^{m-1} T(i) = 2 + 2\,T(m-1) = 2\left(2^m - 1\right) \in \mathcal{O}(2^m)\,\square$$

Next, we show that the strategy fulfils the fundamental Requirements 1 and 2 regarding correctness and hippocraticness, which we defined in Section 2.4.

Theorem 5. *The explanatory strategy is correct.*

*Proof.* Assume the contrary, i.e., that the strategy produces a model assignment M for network N such that M /∈ R<sup>N</sup> . That means that there is an edge (a, b) ∈ E such that (M(a), M(b)) <sup>∈</sup>/ Rt , where t := T(a, b). We distinguish these cases:


All cases lead to a contradiction.

#### Theorem 6. *The explanatory strategy is hippocratic.*

*Proof.* The strategy only produces changes by executing syncx, which, per definition, only generate changes if the models are not in their consistency relations.

Finally, we verify that we have indeed realised Principle 1 and that the strategy does not fail for a network N of only N-converging transformations.

Theorem 7. *The explanatory strategy ensures consistency among the transformations that have already been executed before executing a transformation that has not been executed yet (see Principle 1).*

*Proof.* After the recursive call in Line 8, the current model assignment is consistent with all executed syncx (Theorem 5) and no changes to models adjacent to an executed syncx are allowed.

Theorem 8. *If the input* network *of the explanatory strategy consists only of* network*-converging syncx, then the explanatory strategy does not fail.*

*Proof.* First, we note that when calling the algorithm on a network with m transformations, the first m − 1 iterations of the loop act identically to executing the algorithm on a network without the last candidate. Second, we note that the second part of the loop condition, "accumulatedChanges.*adjacentTo* (candidate)" (Line 5), does not change the algorithm's result apart from controlling the order in which the syncx are executed. If any syncx was never executed because of this condition, then executing it would not have changed any model. Hence, we assume w.l.o.g. that all syncx in network will get executed.

Now we show the following, stronger statement by induction over the number m of edges in network: "After running the explanatory strategy, the sequence of executed syncx contains each permutation of those syncx (not necessarily continuously)". Since the transformations are network-converging and because of our first note above, proving this statement shows that the condition leading to a failure (Line 10) will never evaluate to true. The statement is trivially true for m= 1. Assume that the statement is true for all networks of size 1 ≤ n<m but not true for a network of size m. That means that after executing the last iteration of the loop, there is an order o of the m syncx in network in which they have not been executed yet. Let t be the candidate of the last iteration. Let j be the index of t in o. Per induction assumption, the order o[1] ...o[j−1] has been executed in the previous iterations of the loop. Afterwards, t was executed in Line 6. Per induction assumption, the order o[j+1] ...o[m] has been executed in the recursive call (Line 8) of the last iteration. This happened after Line 6. Hence, the transformations have been executed in the order o. This is a contradiction.

The explanatory strategy only guarantees to produce a consistent model assignment if all syncx are N-converging. We can, unfortunately, not provide an approach to achieve N-convergence by construction or to determine N-convergence. We have, however, also discussed that every universal execution strategy needs to operate conservatively and thus fails in certain cases. Thus, even if a network N

contains syncx that are not N-converging, the explanatory strategy still operates conservatively and at least fails based on the notion of a sensible and well-defined property. In addition, the exponential worst-case performance of the strategy is no limitation, because it does only represent a bound to ensure termination. In cases in which the strategy terminates, we expect the repeated execution of each syncx to perform only few changes in reaction to the changes made by other syncx, as otherwise they are unlikely to be N-converging. The interested reader can try out the explanatory strategy using the previously mentioned simulator [11].

In its current formulation, the explanatory strategy does not prevent the syncx from overwriting the initial user changes. This seems inappropriate, as user changes should usually not be reverted. Other authors address this issue by forbidding changes to models that have been edited by users [3, 30, 29], called "authoritative models". There are, however, practical use cases where such changes should be allowed—the example in Section 4.1 is one of them. An option would be to let the strategy fail as soon as a syncx execution overwrites a user change.

# 6 Conclusion

In this paper, we have discussed influencing factors for designing a universal execution strategy for model transformation networks. Such a strategy orchestrates transformations to create a consistent set of models. It involves determining an order to execute the transformations in, and a bound for the number of executions. We have proven that every universal execution strategy that always terminates needs to be conservative, i.e., it will fail for certain cases in which an execution order of transformations that yields a consistent solution exists. We have argued that providing explainability in cases where an execution strategy fails should be a central design goal. As a result, we have proposed the *explanatory strategy*, which is proven correct and terminates for every input. Additionally, it improves explainability of failures and has a well-defined bound for the number of transformation executions to ensure a reasonable level of conservativeness.

We have formalised our findings on execution bounds and the behaviour of the proposed execution strategy to prove the insights and expected properties of the strategy. In consequence, this paper provides fundamental knowledge about the design space and relevant design goals of transformation network execution strategies. While the statements on correctness and well-definedness are proven, those on the usefulness of the strategy were derived by argumentation. To improve evidence of the results, the authors plan to apply the strategy to realistic use cases, involving larger networks of more complex transformations.

Furthermore, the authors want to examine how the strategy can be further optimised: It might, e.g., be improved by backtracking and trying further candidate transformations, or by selecting the next candidate more carefully. Since early executed transformations will be executed most often, starting with those that will most unlikely cause conflicts might be beneficial. Finally, this paper assumes transformations to be binary. Since the presented strategy does not require this, future research could investigate transferability to multiary transformations.

### References


#### Image Sources

paintingred: "Default Avatar Headshot Icons", found on Vecteezy. https://www.vecteezy.com/vector-art/141712-default-avatar-headshot-icons. Vecteezy Free License.

Object Management Group: UML logo. https://www.uml.org/index.htm. Trademark.

Palladio logo. https://sdqweb.ipd.kit.edu/wiki/File:Palladio-Logo-stilisiert-vektor.pdf. Authorized use.

The Linux Foundation: OpenAPI™ logo. https://github.com/OAI/OpenAPI-Style-Guide/blob/master/graphics/ vector/OpenAPI\_Logo\_Black.svg. Trademark.

Freepik: "Computer". https://www.flaticon.com/free-icon/computer\_1077701. Flaticon Basic License.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **CoVEGI: Cooperative Verification via Externally Generated Invariants**

Jan Haltermann(-) and Heike Wehrheim

Department of Computer Science, Paderborn University, Paderborn, Germany jfh@mail.upb.de, wehrheim@upb.de

**Abstract.** Software verification has recently made enormous progress due to the development of novel verification methods and the speed-up of supporting technologies like SMT solving. To keep software verification tools up to date with these advances, tool developers keep on integrating newly designed methods into their tools, almost exclusively by re-implementing the method within their own framework. While this allows for a conceptual re-use of methods, it nevertheless requires novel implementations for every new technique.

In this paper, we employ *cooperative verification* in order to avoid reimplementation and enable usage of novel tools as black-box components in verification. Specifically, cooperation is employed for the core ingredient of software verification which is *invariant generation*. Finding an adequate loop invariant is key to the success of a verification run. Our framework named CoVEGI allows a master verification tool to delegate the task of invariant generation to one or several specialized helper invariant generators. Their results are then utilized within the verification run of the master verifier, allowing in particular for crosschecking the validity of the invariant. We experimentally evaluate our framework on an instance with two masters and three different invariant generators using a number of benchmarks from SV-COMP 2020. The experiments show that the use of CoVEGI can increase the number of correctly verified tasks without increasing the used resources.

**Keywords:** Cooperation, Software Verification, Invariant Generation

# **1 Introduction**

Recent years have seen a major progress in software verification as for instance witnessed by the annual competition on software verification SV-COMP [2]. This success is on the one hand due to advances in SAT and SMT solving and on the other hand due to novel verification methods like interpolation in model checking [36], automata-based software verification [31] or property directed reachability [16]. Still, automatic verification remains a complex and error-prone task. In particular, it is often the case that one tool can verify a particular class

<sup>-</sup> This author was partially supported by the German Research Foundation (DFG) under contract WE2290/13-1.

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 108–129, 2021. https://doi.org/10.1007/978-3-030-71500-7 6

of programs, but fails to verify other classes (or even gives incorrect answers), whereas it is the reverse situation for another tool. Moreover, to keep their tools up to date with novel techniques, tool developers keep on integrating them by re-implementation within their framework.

An approach for changing this unsatisfactory situation is *cooperative verification* (for an overview see [13]). Cooperative verification builds on the idea of letting tools (and thus techniques) cooperate on verification tasks, thereby leveraging the tool's individual strengths. In particular, cooperative verification aims at *black box* combinations of tools, using existing tools off-the-shelf without re-implementation. While this sounds like a natural idea, its realization poses a number of challenges, the major one being the *exchange* and *usage* of analysis information. For cooperation, tools are required to produce (partial) results which other tools can understand and employ in their verification run. With conditional model checking [7], the first proposal of an exchange *format* for verification results was made. A conditional model checker outputs its (potentially partial) result in the form of a *condition* which can be read by other conditional model checkers in order to complete the verification task. Since verification tools normally do not understand conditions, *reducers* [23,9] have been proposed to bring conditions back into a form understandable by verifiers, namely into (residual) programs describing the so far unverified program part. This allows the result of a conditional model checker to be made usable by arbitrary other verifiers. A second type of existing result usage is the *validation* of tool's results [4,34], similar to proof-carrying code [37]. Both of these types are sequential forms of cooperation: a first verifier starts and a second verifier continues, either by completing or by validating a first result.

In this paper, we propose CoVEGI, a cooperation framework which complements these existing approaches by a new type of cooperation. Conceptually, this framework (depicted in Figure 1) consists of a *master verifier* and a number of *helper invariant generators*. The master verifier has the overall control on the verification process and can *delegate* tasks to helpers as well as *continue* its own verification process with (partial) results provided by helpers. The helpers run in parallel as black boxes without cooperation. The task to be delegated is an integral part of software verification, namely *invariant generation*. The framework allows cooperation via outsourcing the task of invariant generation, leveraging the strength of specialized invariant generation tools.

Like for other types of cooperation, the question of the exchange format for results comes up. Here, we have chosen *correctness witnesses* [3] for this purpose. Correctness witnesses are employed in witness validation and certify a verifier's result stating the correctness of a program. These witnesses are particularly well suited for our intended usage, because their format is standardized and a number of verifiers already produce correctness witnesses. To account for the incooperation of helper verifiers not producing witnesses, our framework also foresees the inclusion of *adapters* transforming invariants into correctness witnesses. We provide an implementation of two such adapters. Witnesses are then *injected* into the verification run of the master. For stating the task to be solved by invariant

Fig. 1: Cooperative verification via externally generated invariants

generators we furthermore require *mappers* transforming program and property to be proven into a task format understandable by the helper tools. Figure 1 depicts our framework for cooperative verification via externally generated invariants. The framework can be arbitrarily configured with different masters and helpers, provided that suitable adapters and mappers are given.

We have implemented CoVEGI within the CPAchecker framework [10] and have employed different configurations of it as master verifier. As helpers we have chosen publicly available verification tools, some producing and one not producing witnesses. We have then experimentally evaluated 14 different combinations of master and helper on benchmarks of the annual competition of software verification SV-COMP [2]. The experiments show an improvement over the verification capabilities of the master tool, without incurring significant overhead. In some cases, the verification time is even decreased in cooperative verification.

Summarizing, we make the following contributions.


# **2 Fundamentals**

We aim at the cooperative verification of programs written in GNU C, focusing on the validation of safety properties. To be able to define safety properties, a

<sup>1</sup> https://llvm.org/docs/LangRef.html

formal representation of programs as well as their semantics is needed. Thus we briefly introduce the syntax and semantics of programs which we consider here.

We follow the notation of Beyer et al. [6] describing programs as *control-flow automata* (CFAs). A CFA is basically a control-flow graph with edges annotated with program statements. More formally, a program is represented as a controlflow automaton C = (L, l0, G), consisting of a set of program locations L, an initial location l<sup>0</sup> ∈ L and the control-flow edges G, G ⊆ L × Op × L. The set Op contains all possible operations on integer variables<sup>2</sup> present in the program, namely conditions (as of conditionals and loops), assignments, method calls and return statements. Figure 2(a) shows a C-program taken from the SV-COMP benchmarks<sup>3</sup>, and Figure 2(b) its corresponding CFA. The program also contains a special *error* label, used for encoding the property to be verified. The verification task for this program is to show the non-reachability of the error label at location 9, i.e., for our example program the verifier has to prove that y equals n after the loop which is true (since n is unsigned).

For the semantics, we start by defining program states. Let *Var* denote the set of all integer variables occurring in programs, *BExp* the set of boolean expressions and *AExp* the set of arithmetic expressions over *Var* . Then a *state* σ of the program is a mapping from the variables to the integers, i.e., <sup>σ</sup> : *Var* <sup>→</sup> <sup>Z</sup>. We lift the mapping to also contain the evaluation of arithmetic and boolean expressions so that σ maps *AExp* to Z and *BExp* to B. A finite *program path* π is a sequence of *transitions* σ0, l0 <sup>g</sup><sup>0</sup> → σ1, l1··· <sup>g</sup>n−<sup>1</sup> → σn, ln, such that σ<sup>0</sup> assigns 0 to all variables, l<sup>n</sup> is a leaf in the CFA and (li, gi, l<sup>i</sup>+1) ∈ G holds for each transition σi, li <sup>g</sup><sup>i</sup> → σ<sup>i</sup>+1, l<sup>i</sup>+1 in π. Infinite program paths are defined analogeously. As for state changes in paths: If g<sup>i</sup> is a boolean expression, method call or return statement, then σ<sup>i</sup> = σ<sup>i</sup>+1 holds. If g<sup>i</sup> is an assignment x = a, where a ∈ *AExp*, then σ<sup>i</sup>+1 = σi[x '→ σi(a)]. Finally, we denote all paths of a program represented by a CFA C by paths(C).

Here, we are interested in verifying safety properties of programs given as CFAs. For the purpose of this paper, we define a *safety property* P as a pair of a location ∈ L and a boolean condition ϕ ∈ *BExp*. There can be multiple safety properties required to hold in a program. For our example program of Figure 2 the property is (8, n = y). For the verifier this is encoded in the form

8: if (!(n==y)) 9: Error: return 1;

A CFA (or program) C *violates a safety property* P = (, ϕ) when the program reaches location in a state which does not satisfy ϕ. More formally, P is violated by <sup>C</sup>, if there is some path <sup>π</sup> <sup>∈</sup> paths(C), <sup>π</sup> <sup>=</sup> σ0, l0 <sup>g</sup><sup>0</sup> → σ1, l1··· <sup>g</sup>n−<sup>1</sup> → σn, ln and some i, 0 ≤ i ≤ n, such that <sup>i</sup> = and σi(ϕ) = *false*.

<sup>2</sup> In our formalization, we use integer variables only, the implementation covers C programs.

<sup>3</sup> https://github.com/sosy-lab/sv-benchmarks

Fig. 2: An example program, its control flow automaton and one witness

Cooperatively verifying safety of programs is achieved in our framework via external (loop) invariant generation. Syntactically, a *loop invariant* is a boolean expression associated to a loop head. A loop invariant needs to hold (1) before the first loop execution and (2) after each loop execution. The expression n = x+y, for instance, is a loop invariant for the program in Figure 2(a), associated to the loop head at location 4. This loop invariant facilitates verification, because in conjunction with the negated loop condition and information about initial variable values it ensures n to be equal to y after the loop. Other valid loop invariants would be x ≥ 0 or n = 3 ⇒ y ≤ 5, which however all do not help in proving the safety property. Especially the loop invariant *true* does not provide any information. Thus, we call it a *trivial invariant*.

As stated before, we chose *witnesses* (more specifically, correctness witnesses) as exchange format during collective invariant generation. Formally, a witness is a finite state automaton in which transitions are labelled with so called *source code guards* and states can be equipped with boolean expressions. When all these boolean expressions are either *true* or *false*, we call the witness *trivial*. Source code guards are of the form location,type where type can be then, else, enterFunc and enterLoopHead. The guard o/w (otherwise) is used if a source code line does not match the other guards present. Via these labels we can match transitions of the automaton with edges in the CFA. Syntactically, correctness witnesses are stored in an XML format and consist of two parts: (1) general information like the program associated with the witness, and (2) a GraphML representation of the witness automaton. More information and a formal specification of correctness witnesses can be found in [3].

In Figure 2(c), we see a correctness witness for our example program. State q<sup>3</sup> is reached by transitions labelled 3,enterLoopHead or 6,enterLoopHead and thus corresponds to the loop head at program location 4. Associated with this state is the invariant n = x + y.

#### **3 Concept**

In this section, we introduce our novel concept of **Co**operative **V**erification via **E**xternally **G**enerated **I**nvariants (CoVEGI), shown in Figure 1. The framework contains two sorts of main components: Master verifiers (one) and helper invariant generators (several). Next, we state some requirements on and explain the functionality of these components as well as their cooperation.

#### **3.1 Components of the CoVEGI-Framework**

The most important component of the framework is the master verifier, which we build out of an existing verifier. The master is responsible for coordinating the verification process and can, if needed, request support from the second type of components, the helpers, in the form of invariants as described by correctness witnesses. Hence, the master is also steering the cooperation.

In the following, we explain the two sorts of main components in more detail:


We can neither expect existing verification tools which we wish to use as helpers to be able to work on CFAs, nor to understand the safety property or to produce witnesses. Hence, we foresee two further sorts of components in our framework:


Table 1: Overview of the configuration options available


#### **3.2 Cooperation within CoVEGI**

After having explained the individual components, we define their interaction in the framework. In this paper, we focus on the *parallel* execution of several helpers which implement complementary approaches so that we can leverage their individual strengths. Algorithm 1 describes the form of cooperation. It is steered by several user configurable options which fix aspects like time and resource limits of master and helpers. Table 1 summarizes the configuration options. We next describe them in detail.



**Timeouts** Finally, similar to the master, we can set a specific timeout for the helpers which fixes how long they are allowed to try to generate invariants. The timeout option is called timeoutH.

Next, we explain the CoVEGI algorithm shown in Algorithm 1 in detail. We assume that master and helpers run as threads and can be started and stopped. We furthermore employ methods wait for waiting until some condition is achieved and join for waiting for a specific thread to complete.

Initially, the master verifier is started without any helper invariant generators running in parallel (line 1), providing the opportunity to verify programs on its own. It runs standalone until it requests for help (either due to not being able to solve the problem alone or due to hitting its timer) or until it computes a result which is subsequently returned (line 3). Afterwards all helpers are started in parallel (lines 5 and 6). They also run until they reach their timeout, a solution is found or they are stopped. Their solutions (invariants) are inserted into the witness set (line 9). Depending on option termAfterFirstInv, either all but the first finished helper are stopped or it is waited until all helpers either computed a solution or ran into their timeout. If invariants (witnesses) have been computed, these are injected into the master (line 18). If the restartMaster option is set, the master needs to be stopped before injection and restarted afterwards. Then the master continues and completes its verification (without any further request for help) and the result is finally returned.

*Example 1.* To explain the framework's functionality, we demonstrate the CoV-EGI algorithm on the example presented in Figure 2(a). Assume that we instantiate the framework with a master verifier and four helper invariant generators, that are used in parallel<sup>4</sup>. Moreover, we configure the framework as follows: We set restartMaster to true, terminateAfterFirstInv to false, timerM to 50 seconds and timeoutH to 300 seconds.

Initially, the master verifier runs standalone and after 50 seconds runtime it requests help. The master runs in parallel with the four helper invariant generators being called. Let us assume that the first helper returns only trivial invariants (after 10s), the second one the invariant n ≥ y (after 50s), the third one the invariant n = x+y (after 100s) and the fourth the invariant n−x−y = 0 (after 500s). The trivial invariant is ignored (see check in line 8) and when the second helper returns a solution, the third and fourth helper are still not stopped, due to the chosen configuration. The algorithm waits until the third helper computes the invariant and the fourth (only being able to compute an invariant after 500s) hits the timeout after 300s. Then the master is stopped, the invariants n ≥ y and n = x + y are injected and the master is restarted. The master verifier can use both invariants and might now compute the correct result.

#### **3.3 Witness Injection**

As master verifiers need to offer witness injection, we explain a possible procedure for predicate abstraction and k-induction, which are the two techniques we use as masters during the evaluation. For both, the invariants are extracted from the witness and then added to the analysis information already computed by the master verifier. Both analyses store their analysis information in an *abstract reachability graph* (ARG). Broadly speaking, an ARG is a CFA equipped with predicates. More formally, an ARG is a finite state automaton, where nodes, called *abstract states*, consist among others of analysis information (i.e. predicates) and program locations. Two nodes within an ARG are connected if their program locations are connected within the CFA. Note that a program location may occur in multiple abstract states, e.g. when the analysis unrolls a loop. Hence, witness injection has to update all the abstract states for whose program location the witness contains an invariant.

**Predicate Abstraction.** We use a predicate abstraction technique [11], conducting predicate refinement using a CEGAR (counter example guided ab-

<sup>4</sup> In [29] is is shown that more than two helpers does not practically make sense.

Fig. 3: Workflow of an adapter for an helper working on an IR

straction refinement) scheme [20] with lazy-abstraction [33] and Craig interpolation [32].

*Witness Injection:* The predicate abstraction maintains, for each abstract state, one set of available predicates (called *precision*) and one set of valid predicates. Witness injection is realized by extracting all predicates and the corresponding locations from the invariants. If these predicates contain conjunctions of clauses, these are furthermore split up and inserted individually. Splitting predicates increases the performance due to the fact that SMT solvers perform better on many small predicates than on few larger ones<sup>5</sup>. These predicates are added to the precision of abstract states corresponding to the locations specified in the witness. Thereby, the predicates are used during the next abstraction performed by the analysis. The abstraction function itself guarantees that only predicates from the candidate set being valid at the current location are used. Thus, invalid invariants are ignored. This procedure can also be used when restarting predicate abstraction, by adding the predicates from the witness to the initial precision of the abstract states corresponding to the locations specified in the witness (which is empty otherwise).

**k-Induction**. The basic idea of k-induction [25] is to generalize bounded model checking (BMC) [14] via induction. After proving k-bounded program executions safe using BMC, a generalization is aimed for. Therefore, it generates auxiliary invariants that are continuously refined using a CEGAR based analysis [5]. These invariants are combined with the information generated by BMC and generalized to a safety proof by successfully conducting an induction step. *Witness Injection:* For both cases, adding invariants into a running analysis or adding before restarting, we make use of the same idea: Whenever a witness is made available to the analysis, the encoded predicates and the program locations are added as candidates to the set of auxiliary invariants, generated by the analysis. New elements in this set are periodically checked for validity by kinduction. Thereby, valid externally generated invariants are conjoined with the predicates stored in the analysis abstract states, corresponding to the invariants location. Invalid invariants are thus ignored.

#### **3.4 Adapter for LLVM-based Helper Invariant Generators**

Next, we exemplify an adapter for helper invariant generators working on LLVM, following the general construction depicted in Figure 3. Often, tools associates invariants to LLVM basic blocks. A basic block is a code fragment having a single

<sup>5</sup> This has been reported by tool developers and has also shown in our experiments.

entry location (the first) and a single exit location (in general the last location of the block). To construct a witness containing the invariants, we need to translate them and find the matching C-code location for the basic block. For both, we use the LLVM-IR equipped with debug information, using the compiler with launch parameter -g. Thereby, we obtain the IR-code fragment of the program in Figure 2(a), shown in simplified form and containing the most important debug information as comments. The example contains two basic blocks, entry and bb.


The helper invariant generator computes the invariant v1 − v4 − v3 = 0 for the example and associates it with the basic block bb. At first, we need to transform the variables from the IR to C-variables occurring in the program. In this example we can use the debug information, as shown in comments in the code. In general, a more sophisticated procedure is needed since LLVM-IR uses a three address code. Therein, complex expressions are split into several statements using intermediate variables which are resolved to C-expressions.

Afterwards, the transformed invariant needs to be associated with the correct location in the C-code. We analyze the LLVM IR program structure to map the basic blocks back to C-locations. In the example, the block bb is identified as being the loop of the program, thus the invariant is mapped to the loop head. For this, we employed some basic functions provided by PHASAR [41] in our adapter. Finally, we construct the CFA of the C-program, store the invariants at the nodes and convert the equipped CFA to a verification witness.

# **4 Evaluation**

In the following, we evaluate different instantiations of CoVEGI. We focus on both effectiveness and efficiency, generally aiming at checking whether the use of CoVEGI can increase the number of correctly solved verification tasks within the same resource limits. A more detailed evaluation of CoVEGI can be found in an extended pre-print [29].

#### **4.1 Research Questions**

In the evaluation, we were interested in the following three research questions.


Table 2: Summary of tools used as helpers


#### **4.2 Experimental Setup**

**Tools.** To be able to evaluate the performance of our framework CoVEGI, we instantiated it with predicate abstraction and k-induction as master verifiers and three helpers, using existing off-the-shelf invariant generation tools. We based the implementation of our CoVEGI algorithm on CPAchecker<sup>6</sup> 1.9.1. To the best of our knowledge, there are no standalone and publicly available invariant generators, that generate invariants for both, global and local variables, without doing a full verification. To be able to evaluate CoVEGI, we decided to use offthe-shelf verifiers as invariant generators instead, by only using the generated invariants. We thus looked at current and past participants of the annual competition of software verification SV-COMP [2] for invariant generation. We chose the tools SeaHorn [28], UltimateAutomizer [30] and VeriAbs [1]. Both UltimateAutomizer and VeriAbs achieved excellent results in this year's SV-COMP, being the reason to chose them. As third tool we use SeaHorn, a verification tool neither currently participating in the SV-COMP nor producing witnesses. It operates on the LLVM intermediate representation, therefore we used the adapter exemplified in Section 3.4. The three helper invariant generators are used as black-boxes and employ verification techniques complementary to those of both the other helpers and the two masters. An overview of the techniques employed in these tools is given in Table 2. The table also states whether the helpers require mappers and adapters. For VeriAbs and UltimateAutomizer we used the versions as used in the SV-COMP 2020<sup>7</sup>. Due to the fact that there is no precompiled binary of SeaHorn, we employ the docker

<sup>6</sup> https://github.com/sosy-lab/cpachecker, Revision (8646a85)

<sup>7</sup> https://gitlab.com/sosy-lab/sv-comp/archives-2020/tree/master/2020


Table 3: Comparison of the two master verifiers running standalone and using a single helper.

container of the latest version<sup>8</sup>. All three helper invariant generators are used in their default configuration.

During evaluation, we used the following default configurations for our own framework: We set termAfterFirstInv and restartMaster to true, setting the timerM to 50s<sup>9</sup> and the timeoutH to 300s. In general, we will use the abbreviations SH for SeaHorn, UA for UltimateAutomizer and VA for VeriAbs.

**Verification Tasks.** The verification tasks used are taken from the set of SV-COMP 2020 benchmarks<sup>10</sup>. As we are interested in finding suitable loop invariants, we selected all tasks from the category ReachSafety-Loops. To obtain a more broad distribution of tasks, we randomly selected 55 additional tasks from the categories ProductLines, Recursive, Sequentialized, ECA, Floats and Heap, yielding in total 342 tasks.

**Computing Resources.** We conducted the evaluation on three virtual machines, each having an Intel Xeon E5-2695 v4 CPU with eight cores and a frequency of 2.10 GHz and 16GB memory, running an Ubuntu 18.04 LTS with Linux Kernel 4.15. We run our experiments using the same setting as in the SV-COMP, giving each task 15 minutes of CPU-time on 8 cores and 15GB of memory. We employed Benchexec guaranteeing these resource-limitations [12].

**Availability.** Our tool and all experimental data are available<sup>11</sup>.

#### **4.3 Experimental Results**

We implemented the CoVEGI-framework as proof-of-concept in the CPA-

checker-framework. For this, we had to extend the existing implementations of k-induction and predicate abstraction with witness injection. For the helper invariant generators we did not change a single line of code, only adding adapters if needed. Integrating helpers like VeriAbs, not requiring an adapter or a mapper, can be done within a few lines of code. Although the implementation is a proof-of-concept, this shows that the presented framework works in practice

<sup>8</sup> suggested by the developers; used docker seahorn/seahorn-llvm5 (4c01c1d)

<sup>9</sup> Which has turned out to be a preferable value, as we explain in [29]

<sup>10</sup> https://github.com/sosy-lab/sv-benchmarks/releases/tag/svcomp20

<sup>11</sup> https://covercig.github.io/covegi/

and is applicable to all kinds of off-the-shelf helper invariant generators, those producing verification witnesses as well as those generating invariants in IR.

**RQ1 (Effectiveness).** To evaluate whether a master verifier benefits from the support of a helper, we execute a combination of a master and a helper in the default configuration and compare it to the master running standalone. Here, we are interested in the number of *correct* verification results, i.e., the verifier correctly reporting the safety property to be fulfilled (result true) or not (result *false*). Running standalone, k-induction can correctly solve 146 of the verification tasks, predicate abstraction 116.

Table 3 gives the results of this experiment. In the table we see the overall number of correct results, the number of correct true and correct *false* results plus the number of tasks additionally solved when using a helper and uniquely solved by the configuration. Through the cooperative invariant generation, the performance of both masters is increased. As expected, this applies to verification tasks with fulfilled safety property only, i.e., the invariant generators can help in proving a property to hold, but cannot help in refuting properties (as they correctly do not generate invariants in these cases). Besides the additionally solved tasks, there is also one (for SH and UA) and two (for VA) tasks, respectively, which cannot be correctly solved anymore. In these cases, the master consumes most of the CPU time available, hence sharing resources in cooperation with the helpers results in a timeout.

On our data set, the total number of correctly solved tasks using CoVEGI increases by 12% for k-induction and 14% for predicate abstraction as master.

**RQ2 (Efficiency).** Next, we evaluate the efficiency of CoVEGI, analyzing the CPU time spend solving the verification tasks. As CoVEGI eventually shares the CPU time between master and helpers, we expect that more time is needed to compute a correct result after the helper is started.

Figure 4 shows two quantile plots of the verification runs, 4(a) with kinduction and 4(b) with predicate abstraction as master. A datapoint (x, y) in the plot means that the verifier computes the x-fastest correct results in at most y seconds. As CoVEGI instances behave like masters standalone in the first 50 seconds, we only show results *not* solved within these 50 seconds. We see that for tasks requiring a low amount of time, all instances (including the master alone) require a similar amount of CPU time. For tasks requiring more time, CoVEGI is actually often faster, the extreme being predicate abstraction as master which alone is unable to solve more difficult tasks in the given time.

We exemplarily also compared the CPU time of k-induction standalone with CoVEGI using VeriAbs as helper *per task*. It turns out that sharing does only slightly impact the runtime, as shown in Figure 5. The scatter plot compares the CPU time of k-induction standalone as master and k-induction supported by VeriAbs, in case both tools solved the task correctly. A datapoint (x, y) means that k-induction standalone takes x seconds to solve the task and in combination with VeriAbs y seconds. The red dashed box contains all tasks solved within 50 seconds, where both tools behave equally, since the master does not request for

(b) CoVEGI using predicate abstraction as master

Fig. 4: Quantile plots for CoVEGI using different single helpers.

help in these cases. We see some tasks for which helping increased the runtime, but also some for which it decreased it. In most of the cases, the CPU time used by CoVEGI is not significantly higher.

Finally, we compare the average CPU time needed to correctly solve a task. Table 4 shows the average time needed for all tasks and – in brackets – for the correctly solved tasks only. We observe that the runtime increases when only looking at correctly solved tasks (in particular for VeriAbs), however, when considering all tasks the CPU time is even decreased. The latter effect is due to the number of timeouts of the master decreasing when cooperating with helpers. Concluding, we can make the following observation.

On our dataset, collaborative invariant generation does not negatively impact the effectiveness; in some cases we even see small improvements.

**RQ3 (Combination of helpers).** In RQ3, we were interested in finding out (a) whether it is beneficial to run two invariant generators in parallel, and (b) if yes, which pair is best for this. We thus studied the number of correctly solved tasks using the three possible pairs of helpers, each running two helpers in parallel. Table 5 shows the results.

Table 4: Total CPU time for all tasks and average CPU time taken for a correct answer in brackets, both in seconds.


Fig. 5: Scatter plot for kInd and kInd-VA

Table 5: Number of correctly solved tasks using different forms of cooperation with two or three helpers running in parallel.


For checking whether parallel execution of helpers is beneficial, these numbers need to be compared against those for a single helper as given in Table 3. We see that predicate abstraction benefits from using two helpers, especially using UltimateAutomizer and VeriAbs. Using CoVEGI with these tools perfectly combines their strengths, thereby increasing the number of correctly solved tasks in total by 17%. In contrast, it turns out that for k-induction none of the combinations of two helpers outperforms CoVEGI using VeriAbs only. For UltimateAutomizer and VeriAbs as helpers, the total number does not change, only the set of solved tasks. For instance, nearly 50% of the additional tasks solved by kInd-UA-VA are not solved using kInd-UA and vice versa. This result is based on the fact that they have to share the available CPU time in the combination. Hence, tasks that are solved using one of them as helper alone could not be solved anymore in a combination because of timeouts. This phenomenon is even more an issue when running all three helpers in parallel.

The combination of all three helpers solves only 154 tasks correctly for kinduction and 129 for predicate abstraction. In addition, we evaluated different values for parameter timeoutH in [29], whereas it turns out that waiting for all helpers to finish does not increase the number of correctly solved tasks.

On our dataset, CoVEGI can increase the total number of correctly solved tasks using UA and VA in parallel; in general waiting for the other tool to also finish its computation does not pay off.

#### **4.4 Threads to Validity**

We have conducted our evaluation using a random sample of tasks as well as those in the category Loops. Although this guarantees some diversity, our findings may not completely carry over to arbitrary real-world programs.

The experiments are conducted using the reliable framework Benchexec on identical machines with same resource limitations, guaranteeing comparable results. As SeaHorn is used within a docker-container, its CPU usage however cannot be measured by Benchexec. We therefore measured this externally, rounded it up and added it to the measured CPU time, obtaining a lower bound for the correctly solved tasks. Thereby, all results stay valid, especially of the best performing instantiations of CoVEGI, as they do not use SeaHorn.

Our implementation of CoVEGI relies on the correctness of the used master verifiers and helpers (which are given) as well as on the adapters (which we build). An incorrectly translated invariant may however influence the performance only negatively. Both master verifiers used as well as UltimateAutomizer and VeriAbs are participating in the annual SV-COMP, hence they might be tuned to the tasks employed. This does however not influence the validity of the results since our interest is in the *additional* number of tasks solved by cooperation, not the solved ones per se.

#### **5 Related work**

In this paper, we presented a framework for cooperative verification via collective invariant generation. The idea of collaboration for verification by combining known techniques has been widely employed before. For instance, there are combinations of verification with testing approaches [21,22,26,18,19,24] and with approaches for invariant generation [40,27,39,15,17]. The latter combinations are conducted in a *white box* manner using strong coupling between the components, making the addition of a new approach a challenging task. Our framework conceptually decouples the invariant generation from the verification, making it more flexible. In addition, using a black box integration with defined exchange formats allows us to easily exchange or integrate new approaches.

There are also existing concepts for collaboration between different techniques in a *black-box* manner. Conditional model checking is a technique for sequentially composing different model checkers, sharing information between the tools in form of conditions [7]. Beyer and Jakobs developed a concept for combining model checking with testing [8]. Although both approaches enable cooperation, none combines a verification tool and tools for invariant generation.

We next shortly discuss three approaches which are conceptually closer to our framework. Frama-C is a framework for code analysis, aiming for analyzing industrial size code [35]. The framework contains different plugins, each implementing a verification or testing technique. The plugins can exchange information in form of ASCL source code annotations. Within Frama-C, the analyzers can collaborate by being either sequentially or parallelly composed. For this, partial results produced by an analysis can be completed by a second one or several partial results computed in parallel are composed to a complete result. Frama-C offers the general possibility to define cooperation between existing plugins. To the best of our knowledge, Frama-C does however not provide a conceptual collaboration of a verification approach and tools for invariant generation driven by the verification approach's demand for support.

The approach of using continuously refined invariants for k-induction [5] uses a lightweight dataflow analysis which can be considered to be a helper for verification. Therein, the supporting invariant generator runs in parallel to the kinduction analysis. Compared to our framework, the main difference is the form of cooperation used. Beyer et al. use a white-box integration for the cooperation between k-induction and the invariant generator, building hardly wired connections between both analyses and sharing the information *inside* the tool. Thus, integrating external tools is hard to achieve. Moreover, the approach is designed to work for k-induction only. Note that an analogeous approach is proposed by Brain et al. [17].

Pauck and Wehrheim proposed CoDiDroid, a framework for cooperative taint flow analysis for Android apps [38]. Within their framework, different analysis tools with specialized capabilities are combined as black-boxes. Co-DiDroid is however tailored to the needs of Android taint flow analysis, thus the exchanged information differs. Thus CoDiDroid is not able to orchestrate or exchange information on safety analysis with shared invariant generation.

To summarize, there are a lot of existing approaches for cooperative verification, but most of them are white-box combinations, and the existing black-box combinations are not general enough to allow for collective invariant generation.

#### **6 Conclusion**

In this paper, we have presented a novel form of black box cooperation for software verification via externally generated invariants. Within the configurable framework named CoVEGI, the so called master verifier steering the verification process is able to delegate the task of invariant generation to one or several helper invariant generators.

We implemented CoVEGI within the CPAchecker framework using kinduction and predicate abstraction as master analysis supported by three existing helpers SeaHorn, UltimateAutomizer and VeriAbs. Our evaluation on a set of SV-COMP verification tasks shows that CoVEGI increases the number of correctly solved tasks without increasing the overall verification time. The best combination of helpers, UltimateAutomizer and VeriAbs in parallel, yields an increase of 12% for k-induction and 17% for predicate abstraction.

Next, we plan to enhance the cooperation by analyzing the behavior of the master in order to identify an optimal point to request for help. Moreover, extending CoVEGI by additionally taking error traces found by the helper into account is also scheduled. In addition, we intend to investigate whether a selection of helpers on the basis of the given verification task is beneficial.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Engineering Secure Self-Adaptive Systems with Bayesian Games**

Nianyu Li1(-) , Mingyue Zhang<sup>2</sup>, Eunsuk Kang<sup>3</sup>, and David Garlan<sup>4</sup>

<sup>1</sup> Peking University, Beijing, China nianyu li@pku.edu.cn

<sup>2</sup> Peking University, Beijing, China mingyuezhang@pku.edu.cn

<sup>3</sup> Carnegie Mellon University, Pittsburgh, USA eunsukk@andrew.cmu.edu

<sup>4</sup> Carnegie Mellon University, Pittsburgh, USA garlan@cs.cmu.edu

**Abstract.** Security attacks present unique challenges to self-adaptive system design due to the adversarial nature of the environment. Game theory approaches have been explored in security to model malicious behaviors and design reliable defense for the system in a mathematically grounded manner. However, modeling the system as a single player, as done in prior works, is insufficient for the system under partial compromise and for the design of fine-grained defensive strategies where the rest of the system with autonomy can cooperate to mitigate the impact of attacks. To deal with such issues, we propose a new self-adaptive framework incorporating Bayesian game theory and model the defender (i.e., the system) at the granularity of *components*. Under security attacks, the architecture model of the system is translated into a *Bayesian multi-player game*, where each component is explicitly modeled as an independent player while security attacks are encoded as variant types for the components. The optimal defensive strategy for the system is dynamically computed by solving the pure equilibrium (i.e., adaptation response) to achieve the best possible system utility, improving the resiliency of the system against security attacks. We illustrate our approach using an example involving load balancing and a case study on inter-domain routing.

# **1 Introduction**

A self-adaptive system is designed to be capable of modifying its structure and behavior at run time in response to changes in its environment and the system itself (e.g., variability in system performance, deployment cost, internal faults, and system availability) [9,12]. One of the major challenges in self-adaptive systems is managing *uncertainty*; i.e., the system should be capable of making appropriate planning decisions despite limited observations about its environment. Achieving *security* in presence of uncertainty is particularly challenging due to the adversarial nature of the environment [17,13]: (1) to avoid detection, a typical attacker may attempt to remain hidden while carrying out its actions, and so accurately estimating its objectives and capabilities can be difficult, and (2) the attacker actively attempts to cause as much harm as possible to the system, and so a typical "average case" analysis may not be appropriate for making optimal defensive decisions [28].

© The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 130–151, 2021. https://doi.org/10.1007/978-3-030-71500-7 7

Various game-theoretic approaches have been explored in the security community for modeling interactions between the system and attackers as a *game* between a group of *players* (i.e., system and multiple attackers, each as one player) and computing optimal strategies (i.e., Nash Equilibrium) for the system to minimize the impact of possible attacks and improve its resiliency against them [40,15,19,28]. These methods can be used to (1) model adversarial behaviors by malicious attackers [19], and (2) design reliable defense for the system by using underlying incentive mechanisms to balance perceived risks in a mathematically grounded manner [15]. In particular, a type of game-theoretic method called *Bayesian games* [25] is designed to explicitly encode and reason about uncertainty in the information that players have (e.g., partial knowledge about each other's actions and objectives).

Prior works in security that leverage game theory [40,15,19,28] have treated the system as an independent player (i.e., defender) in the game. However, such a monolithic approach that involves abstracting the entire system as a single player might be insufficient for capturing certain practical scenarios, where only one part of the system is compromised while the remaining system components may co-operate each other to mitigate the impact of an on-going attack.

In this paper, we argue that compared to a coarse one-player abstraction of a system, modeling the defender under security attacks at the granularity of *components* is more expressive, in that it allows the design of fine-grained defensive strategies for the system under partial compromise. In particular, we advocate a security modeling approach where an attack is modeled as the anomalous behavior of a system component that deviates from its expected behavior, as an alternative to a conventional approach where attackers themselves are modeled as separate players.

To this end, we propose a novel approach to improving the resiliency of self-adaptive systems against security attacks by leveraging game theory. In particular, we propose a new self-adaptive framework that leverages *multi-players Bayesian games* at the granularity of *components* at the system architecture level. Specifically, in our approach, each major system component is modeled separately as an independent player. Under an attack, one or more components with vulnerabilities might be exploited by an attacker to deliberately perform harmful actions (i.e., turning into a malicious type). Different types of attacks that these components might be subject to are encoded as different *types* of game players, encoding uncertainty in the attack being carried out. The rest of the components are then modeled as forming a coalition to mitigate the impact of the malicious actions by those compromised components.

To perform a security analysis, a model of the system architecture and component attacks are translated into a mathematical Bayesian game structure. Then, the adaptive defensive strategy for the system is dynamically computed by solving a pure equilibrium, to achieve the best possible system utility under all assignments of the components to their possible types (i.e., in the presence of security attacks).

Our main contributions are summarized as follows:


### **2 Background**

#### **2.1 Running Example**

As a running example, we adopt Znn.com, a hypothetical news website that has been used as a representative system for the application of self-adaptive systems [10,11]. In a typical workflow, given a request from a client, the web server fetches appropriate content (in form of text) from its back-end database and generates a web page containing a visualization of the text. Furthermore, the system also provides an optional service with multimedia

content (e.g., images, videos). This service involves additional computation on the server side, but also brings in more revenue compared to the requests with only text. With R<sup>M</sup> and R<sup>T</sup> being the revenue, C<sup>M</sup> and C<sup>T</sup> being the computation of one response to a user request with the media content and with only text content, respectively, we assume that R<sup>M</sup> > R<sup>T</sup> > 0 and C<sup>M</sup> > C<sup>T</sup> > 0.

In order to support multiple servers, a *LoadBalancer* is added to distribute the requests from the users to a pool of servers, as shown in Figure 1. The cost of each server is proportional to its load due to, such as potential high response time since companies such as Amazon, eBay, and Google claim that increased user perceived response time results in revenue loss [33]. To be more specific, the cost per server is denoted by (S<sup>i</sup> <sup>−</sup> <sup>T</sup>)<sup>2</sup>/K where <sup>S</sup><sup>i</sup> is the current occupied load for server i, depending on the request serving mode (i.e., S<sup>i</sup> = DiC<sup>T</sup> in text only while S<sup>i</sup> = DiC<sup>M</sup> in multi-media mode where D<sup>i</sup> is the number of requests distributed to server i); T is the threshold beyond which the response time would be affected; K is a constant used to adjust the cost ratio.

The goal of the self-adaptive system is to maximize the difference between revenue and cost.

$$U = R\_M x\_M + R\_T x\_T - \sum\_{i=1}^3 (S\_i \le T \text{ ? } 0 \text{ : } (S\_i - T)^2 / K) \tag{1}$$

where x<sup>M</sup> and x<sup>T</sup> are the numbers of responses with media and text content, respectively; the penalty is the sum of the cost for all three servers.

Suppose that some of the servers are vulnerable to various attacks such as password guessing, SQL injection, command injection, etc [1]. The information

collected from the web server, however, cannot fully demonstrate its compromise due to, e.g., the deficiencies of scanning tools, but with uncertainty. As shown in the Figure, *Server2* could be potentially attacked with a 20% probability while *Server3* is with a higher probability of 50%. These two servers, if compromised in reality, might perform harmful actions controlled by the attackers to achieve their objectives, rendering the loss of system reward. Here we assume the malicious strategies of simply discarding all the distributed user requests. The reward of attacks is denoted by the system loss, i.e., subtracting the maximum reward the system could achieve from the reward under attacks, leading to a zero-sum game.

#### **2.2 Bayesian Game Theory**

*Game theory* is the application of mathematical analysis of individual and cooperative behaviors between players that follow a certain strategy to satisfy their self-interests [21,38]. A *Bayesian* game is a type of game in which players have incomplete information about the other players [25]. For example, a player may not know the exact type (e.g., malicious or good) associated with a unique payoff function of the other players, but instead, have beliefs about these types. These beliefs are represented by a probability distribution over the possible types. More formally, Bayesian games or *incomplete information games* are defined as follows:

**Definition 1.** *A Bayesian game is a tuple* BG = P, A, Θ, U, ρ


Importantly, throughout the Bayesian games, we assume that the assignment of types to players is private information, while the priori type probability distribution, the action spaces and the payoff functions are assumed to be common knowledge. A player's strategy can be pure (i.e., take a deterministic action) or mixed (i.e., randomly choose an action according to some probability distribution). A strategy for player i is s<sup>i</sup> : Θ<sup>i</sup> × A<sup>i</sup> → [0, 1], and ∀θ ∈ Θi, <sup>a</sup>∈A<sup>i</sup> <sup>s</sup>i(a|θ) = 1. The strategy is pure if it satisfies that ∀θ ∈ Θi, ∃a ∈ Ai, si(a|θ) = 1, also denoted as s<sup>i</sup> : Θ<sup>i</sup> → Ai.

**Definition 2.** *(Bayesian Nash Equilibrium Strategy) Given a joint strategy for all players* s<sup>∗</sup> = [s<sup>∗</sup> 1, ..., s<sup>∗</sup> <sup>n</sup>]*,* s<sup>∗</sup> *is the Bayesian Nash equilibrium strategy if for any player* i*, it satisfies that:*

$$s\_i^\* = \arg\max\_{s\_i \in S(\theta\_i)} \sum\_{\vec{\theta}\_{-i}} \rho(\vec{\theta}\_{-i}|\theta\_i) \mathbb{E}\_{\vec{a}\_{-i} \sim \vec{s}\_{-i}^\*, a\_i \sim s\_i} [u\_i(a\_i, \vec{a}\_{-i}; \theta\_i, \vec{\theta}\_{-i})]$$

*where* a−<sup>i</sup> = [a1, ..., a<sup>i</sup>−<sup>1</sup>, a<sup>i</sup>+1, ..., an]*,* θ−<sup>i</sup> = [θ1, ..., θ<sup>i</sup>−<sup>1</sup>, θ<sup>i</sup>+1, ..., θn]*,* s<sup>∗</sup> <sup>−</sup><sup>i</sup> = [s<sup>∗</sup> 1, ...s<sup>∗</sup> <sup>i</sup>−<sup>1</sup>, s<sup>∗</sup> i+1, ..., s<sup>∗</sup> <sup>n</sup>]*,* S(θi) *is the set of all possible strategies for agent* i *under*

<sup>θ</sup>i*, and* <sup>ρ</sup>(θ−<sup>i</sup>|θi) *is the conditional probability representing the player* <sup>i</sup>*'s belief about other players' types under type* θi*.*

Bayesian Nash equilibrium is a set of strategies, one for each type of player. It is the best strategy that maximizes his or her payoff to other players' equilibrium strategies. In a Nash equilibrium, there is no player who can improve his profit by unilaterally modifying his strategy if the actions of the rest are fixed [25,21].

# **3 Self-Adaptive Framework Incorporating Bayesian Game Theory**

Security attacks are usually associated with a high degree of uncertainty where the defender may know little about the identity of the attackers nor fully understand their technical effect on the system. A Bayesian game is a game in which players have incomplete information about the other players, appropriate for modeling and dealing with the attacks with uncertainty. In this section, we propose a new type of selfadaptive framework incorporating Bayesian Game. Adaptation behaviors build on the Nash equilibrium from unexpected attacks and are achieved by elaborating the widely adopted mechanism of the MAPE-

Fig. 2: Self-Adaptive Framework.

K (Monitoring, Analysis, Planning, Execution, Knowledge) loop [27,43], shown in Figure 2.

**Knowledge.** Knowledge Base requires the system developers or domain experts to specify (1) the component and connector model of the managed subsystem and its action space for each component, (2) system objectives usually defined as the quality attributes quantified by the utility, and (3) component vulnerabilities with potential behavior deviations that can be exploited by the potential attacks. Other necessary information such as the history information of system behaviors and environment information are saved in Knowledge Base and can be updated for the sake of self-adaptation.

**Monitor.** Events generated in the managed subsystem or environment indicating the execution of system actions or natural changes in the environmental factors are received. Monitor gathers and synthesizes the on-going attacks information through sensors and saves information in the Knowledge Base. For our example, events such as plenty of user request loss or command injection can indicate a potential attack on the web server.

**Analyzer.** During speculative analysis, conditions of the environment/managed subsystem representing violations or better satisfaction of goals that can arise based on the input from Monitor are identified. The Analyzer performs analysis and further checks whether certain components are attacked with probabilities; potential deviated malicious actions are identified; the rewards for the attack are estimated, based on the knowledge about component vulnerabilities and system objectives. Such attack probabilities can be analyzed with a statistical combination of all feasible scenarios along with expert judgment [16,24]. A typical example is that both Server2 and Server3 are analyzed to be compromised and discarding user requests with a certain probability, reducing the system utility. **Planner.** Planner generates a workflow of adaptation actions aiming to counteract violations of system goals or better achieving goals. It consists of one or a set of actions to be enacted by automatically solving the Multi-player Bayesian Game transformed with the input of potential attacks from the Analyzer and architectural model of the managed subsystem along with the system objectives, which is elaborated in Section 4. For each security situation, it generates an equilibrium if one exists as the adaptation to respond to unexpected attacks, or prompts for a change in the design of the system if the violation cannot be handled. Distributing more percentage of a user request to the normal server while decreasing the percentage to those with a high probability of compromise as well as adjusting the fidelity level for servers could be feasible actions for Znn.com Website under security attacks.

**Executor.** During execution, the strategies from the adaptation equilibrium are enacted on the managed subsystem through actuators. Typical examples could be setting the distribution percentage of user percentage in *LoadBalancer* for each server.

In the next part, we focus on planning activity with Bayesian game theory. We assume adequate monitoring in place, sufficient analysis methods on potential attacks with uncertainties based on observation and historical information, as well as an execution environment through which selected adaptation strategies are enacted.

# **4 Bayesian Game Through Model Transformation**

In this section, we start by defining the system under attacks and transforming the system architecture and on-going attacks into a component-based multi-player Bayesian game. Solving the game with equilibrium is to find the adaptation strategy. Then, we present the analysis results on our running example.

**Component-based System.** A system component is an independent and replaceable part of a system (e.g., a process, program) that fulfills a clear function in the context of a well-defined architecture. Typical examples are the *LoadBalancer* and servers in Figure 1. Components forming architectural structures affect different quality attributes. For example, quality attributes of user satisfaction (i.e., revenue) and the costs (i.e., penalty) identified in the Znn Website example are influenced by the actions of all four components and characterized as utility functions as shown in Eq.(1) mapping them to utility values.

**Definition 3.** *A system can be formally defined as a tuple* S = C, A, Q*.*


Each component is trying to make the right reaction to maximize the system utility, essentially like a rational player in the game theory. Naturally, a system under normal operation could be viewed as a cooperative game dealing with how coalitions interact. Each component is denoted as an independent player and these interacting components/players form a coalition. For instance, in the running example, the *LoadBalancer* and three servers collaborate to achieve the goals together, i.e., maximizing the system reward with revenue and penalty. Specifically, the *LoadBalancer* should assign more user requests to those servers with low computation usage, like the waiting queue in the bank, while the server should adjust the fidelity level according to its current load. A high load may lead to the text only content to decrease the cost while the server with low usage can provide media content to promote the revenue.

**Modeling Utility as Payoffs.** The payoff among those players is allocated by the utility from quality attributes. It is straightforward for developers to design a system-level payoff function (e.g., the revenue and penalty in Section 2.1). However, due to the different roles of the components and the complex relationship between them, it is complicated and sometimes untraceable to manually design an appropriate component-level payoff function. To solve this problem, we use the *Shapley Value Method*, a solution concept of fairly distributing both gains and costs to several players working in coalition proportional to their marginal contributions [37,36], to automatically decompose the system-level utility into the component-level payoff. *Shapley Value Method* applies primarily in situations when the contributions of each player are unequal, but each player works in cooperation with each other to obtain the payoff. Given the component set C, and a system-level utility function v, the payoff for a component i is:

$$\phi\_i(C, v) = \frac{1}{|C|!} \sum\_{C' \subseteq C \backslash \{i\}} |C'|!(|C| - |C'| - 1)![v(C' \cup \{i\}) - v(C')] \tag{2}$$

where |C| is the number of components in the set; C\{i} is the set C excluding component i; v(C ) values the expected system-level utility when the system only consists of the component set C .

The following is a typical example of system utility allocation with the *Shapley Value Method* for the Znn website. To simplify the illustration, we consider the situation where *Server2* and *Server3* are indeed compromised, the *LoadBalancer* chooses the strategy equally distributing user requests to *Server1* and *Server2* (i.e., the requests distributed to *Server1*, *Server2* and *Server3* are 50, 50 and 0 respectively), and *Server1* selects the text only mode. Besides, the total unprocessed requests in the setting are 100, which is assumed to be the full load of a server serving only text, with R<sup>M</sup> = 1.6, R<sup>T</sup> = 1,T = 50, and K = 25 in Eq.(1). The computation capacity of a unit of text and media

is 1 and 1.4 (i.e., C<sup>M</sup> and C<sup>T</sup> ) respectively. Thus, the system utility in this situation is <sup>U</sup>*system* = 50 (i.e., 50 <sup>×</sup> <sup>1</sup> <sup>−</sup> (50 <sup>×</sup> <sup>1</sup> <sup>−</sup> 50)<sup>2</sup>/25 with the remaining 50 requests discarded by malicious *Server2* ). The *cooperative* player set consisting of *LoadBalancer* and *Server1* share this utility while *Server2* and *Server3* fight on behalf of the attacks' interests, thus not being considered in the coalition neither allocated the payoff from the system utility.

Based on Eq.(2), we need the following two cases of coalitions for Shapley Value calculation: (1) If there is only the *LoadBalancer* without *Server1* in the coalition, the utility of the system ULoadBalancer is 0 due to no requests process from *Server1* neither from malicious *Server2* ; (2) If there is only *Server1* without *LoadBalancer* distributing user requests, the requests are randomly passed among three servers, i.e., the requests distributed to *Server1*, *Server2* and *Server3* are 34, 33 and 33 respectively, and the utility of the system for this coalition Userver<sup>1</sup> is 34 (i.e., 34× 1−0). This is because malicious *Server2* and *Server3* do not return any feedback. As a result, φLoadBalancer(C, v)=1/2(U*system* − U*server1* + U*loadbalancer* ) = 8 and φ*Server1* (C, v)=1/2(Usystem − ULoadBalancer + Userver1) = 42. Therefore,the payoff to player *LoadBalancer* and *Server1* are 8 and 42 respectively. Meanwhile, attacks' utility, the difference between system utility and the highest utility the system could achieve without attacks (i.e., equally distributing user requests to three servers and each server choosing multi-media mode in this setting with value 160 = 100 × 1.6 − 0) is equally divided for two malicious players. In other words, both *Server2* and *Server3* is allocated payoff 55 = (160-50)/2. Following the aforementioned allocation process, each player obtains a unique payoff under different attack situations and strategies from the *Shapley Value Method* based on their roles contributing to marginal system utility.

**Component-based Attacks.** A system under security attacks is also defined as a tuple *SAS* = C, A, Q, AT T. Instead of modeling an attacker or several attackers with possible complex behaviors over different parts of the system, we model the on-going attacks AT T the system is enduring at the component level since the vulnerabilities of the components as well as their potential behavior deviations are comparatively easy to observe. AT T can be obtained by synthesizing the information from Monitor and Analyzer as described in Section 3.

**Definition 4.** *The security attacks on the system is formally defined as a tuple* AT T = Catt, Aatt, Patt, Ratt*.*


**Translation into a Bayesian game** With the definition of the system on the component level and the definition of the attacks AT T, a system under security attacks is converted into a non-cooperative Bayesian game by the following steps:


Note that this definition can be easily extended for the situation where a component is simultaneously compromised by different attackers with multiple types. Besides, the game solver we adopted in this work is *Gambit* [35], a collection of tools for building game models, computing game equilibrium and analyzing game results, to efficiently model the Bayesian game translated by the above steps and automatically figure out the equilibrium strategy as the adaptation response.

#### **4.1 Analysis Results for Znn.com Example**

In this subsection, we demonstrate how our approach can produce adaptation decisions under security attacks for Znn website to enhance the system utility. In particular, we exploit the Bayesian game model by following the aforementioned steps and generate the equilibrium. To explore different attack scenarios, we statically analyze a discretized region of the state space, which is projected over two dimensions that vary the malicious probability (i.e., probability S2 and probability S3) of *Server2* and *Server3* respectively (with values in the range [0, 1]). Each state of the discrete set requires a solution of the game with the Nash Equilibrium that quantifies the best utility the system could obtain. The experiment takes less than one minute to generate all the results, as shown in Figure 3, and for each state, the solution generation time is negligible. To set up the experiment, we assume there are 100 user requests - the maximum load of a server in text only mode - with R<sup>M</sup> = 1.6, R<sup>T</sup> = 1, x<sup>M</sup> = 1.4, x<sup>T</sup> = 1, T = 50, and X = 25 in Eq.( 1). Additionally, we adopt the probabilistic model checking method as the benchmark [11,7,32] and compare our Bayesian Game theory method with it in terms of the system utility.

Figure 3 (a) illustrates the percentage of user requests distributed to *Server1* from the strategy for the *LoadBalancer* in equilibrium. As expected, the percentage of *Server1* increases progressively with the increasing malicious probability of *Server2* and *Server3* as more user requests are supposed to be processed by a server under normal operation. In particular, we observed that the user percentage is around one third when both *Server2* and *Server3* are functioning normally (i.e., both probability S2 and probability S3 are 0), with *LoadBalancer* equally delivering the user requests since none of the servers is compromised. Moreover, the percentage for *Server1* reaches around 84% when the other two servers are fully compromised. In this situation, *LoadBalancer* does not deliver all user requests to *Server1* ; otherwise *Server1* may be overloaded with the increasing costs due to high response time which in turn outweigh its benefits of request processing.

Fig. 3: Results for Znn Website: (a) percentage of user requests to *Server1* ; (b) percentage of user requests to *Server2* ; (c) strategies for *Server1* ; (d) system utility with game theory approach; (e) delta utility between Bayesian game theory approach and probabilistic model checking approach.

Figure 3 (b) describes the percentage of user request that *LoadBalancer* delivers to *Server2* in the equilibrium. We can also observe that user requests to *Server2* are negatively proportional to its malicious probability. Particularly, user requests are 50 when probability probability S2 is 0 while *Server3* is fully malicious (i.e., probability S3=1) where *LoadBalancer* should equally distribute the user request to both *Server1* and *Server2*. Figure 3 (c) presents the strategy in equilibrium for *Server1*. The states in which text content is provided are indicated by red triangles, whereas the multimedia strategies for *Server1* are denoted by white rectangles. As we can see, red points are in the upper right corner where malicious probabilities of *Server2* and *Server3* are greater than 50%, which means that they are very likely compromised. Therefore, *LoadBalancer* distributes as many user requests as possible to *Server1*, thus *Server1* choosing to provide text only content in avoid of overloading. Otherwise, *Server1* can provide multimedia content in less load condition to promote user satisfaction with higher revenue.

Figure 3 (d) illustrates the maximum utility the system can achieve under various attack situations. In particular, we observe that the utility reaches around 160 when all three servers are cooperative and is progressively decreased with the increasing malicious probability of *Server2* and *Server3*. This is consistent with the fact that the system utility is deteriorated under security attack. To compare the system utility in game theory with existing methods, we adopt probabilistic model checking [29] as the comparison standard to formally model the running example and synthesize the adaptation strategy maximizing its expectation of the utility by reasoning about reward-based properties [11,7,32]. Figure 3 (e) presents the delta between two approaches (i.e., system utility with game theory approach minus the utility with the probabilistic model checking approach). Without security attacks, the adaptation decision generated by the two approaches achieve the same utility. However, with the increasing malicious probability of *Server2* and *Server3*, game theory approach outperforms, providing the better response to make up for the utility loss due to security attack, and the average delta is 10.54, i.e., 15 percent outperforming with the average utility 80.39 achieved by game theory.

### **5 Evaluation – Routing Games**

To evaluate our approach and assess its applicability for validation, we consider a case study on an interdomain routing application. We first define the game (Section 5.1) and propose a dynamic programming algorithm to solve the equilibrium by decomposing the problem into smaller and tractable sub games (Section 5.2). The results are present (Section 5.3) with a sensitivity analysis, illustrating how the system can choose a robust strategy effective for a range of threat landscapes, and a utility analysis by quantifying the defender's utility with Bayesian game compared to a greedy solution within the security context.

A routing system is usually composed of smaller networks called nodes as shown in Figure 4. Since not all nodes are directly connected, packets often have to traverse several nodes and the task of ensuring connectivity between nodes is called interdomain routing [30,31]. Each node could be owned by economic entities (Microsoft, AT&T, etc.) and might be compromised by the attacker at any time. Therefore, it is natural to

Fig. 4: Routing Scenario.

consider interdomain routing from a game-theoretic point of view. Specifically, game players are source nodes located on a network, aiming to send a package (i.e., starting at N1) to a unique destination node (i.e., N5). The interaction between players is dynamic and complex – asynchronous, sequential, and based on partial information - and the best strategy for each player as the adaptation response is updated as needed.

#### **5.1 Game Definition for Interdomain Routing**

The interdomain routing system is described below with the component-based definition.


Currently, N2 and N4 are analyzed to be potentially attacked based on the historical package delivery record, deliberately sending the package in the opposite direction, extending the delivery time. The game definition with the security attacks is summarized below.


#### **5.2 Dynamic Programming Algorithm**

In practice, a network might be complex and each node could have hundreds of neighboring nodes. It is impractical to directly build a game tree, in the component level with a large number of players (each with a massive action set), and solve such a network in a reasonable time. To deal with the complexity of network nature, we propose an algorithm inspired by dynamic programming to effectively solve the generated Bayesian game for this class of routing problems.

The algorithm 1 for routing game has as input a routing network N – consisting of a starting point s of package delivery and a destination point d. To carry out

dynamic programming, the algorithm uses a set subG to store the set of nodes which have been processed with their best reactive strategy. subG is initialized as an empty set (line 1) and added with node d (line 2) since d does not need the strategy to transmit the package. The algorithm starts by iterating all the nodes in the distance disV alue (line 5), initialized by 1 (line 3). For example, N2, N4 and N7 are qualified in the first iteration. Each node is checked whether it is potentially attacked (i.e., uncertain(n) in line 6). For those uncertain nodes (e.g., N2 and N4), they might affect the strategy of their prior nodes (line 7) (e.g., N1 and N3), which shall be added to todoS (line 8), to be processed to update their strategy due to its neighboring uncertainty. A typical example is that node N3 might trade off the delivery between N4 and N6 even though N4 is in the shortest path from N3 to N5, however, could deliberately send the package back controlled by the attack. If the node is not in todoS to be updated (line 11), it is directly added to the setG (line 12) as the best strategy for such benign node is passing the package down to its adjacent node along the shortest path. In this routing scenario, N2, N4 and N7 is added to subG as their strategies in equilibrium with normal type is easily determined.

After iterating all the nodes in disV alue 1, each node in todoS (line 15) is checked whether it satisfies the condition (line 16) where all its neighboring nodes (i.e., i ∈ adj(n) ) closer to destination (i.e., dis(i, d) == dis(n) − 1) have been solved with their best strategies (i.e., in subG), to build a sub-game. As shown in the example, though both N1 and N3 are prior to an uncertain node, their strategy update is postponed as N6 is not in subG yet, which affects the sub-game generation for N3, in turn delaying the sub-game construction for N1.

An exemplified subgame construction (line 17) starting from N3 is illustrated in Fig 5 when all conditions are satisfied. The stochastic behavior of those potentially compromised nodes can be modeled by introducing a nature (or chance player), who moves according to the probability distribution (e.g., 50%/50% split), randomly determining whether attacks on N2 and N4 are successful. Then, N3 can choose an action passing to the one from the set of its adjacent nodes, i.e., N6 or N4. Here, N3 is a normal node aware of that the package is transmitted from N1 and it is not necessary to consider a rollback to N1. The game is ended after N3's action as we

can prune the following branches: 1) to N6, the remaining route sequence is N7 and N5 by default as their best strategy have been solved (i.e., N6 delivers the package to N7, which in turn forwards to N5); 2) to N4, with N4 forwarding to N5 if it is normal while backing to N3 in malicious type. When the game terminates, each player gets a unique payoff following different branches. As

**Algorithm 1** Dynamic Programming Algorithm to Solve Routing Game.

```
1: setG ⇐ ∅
2: addNode(d, setG)
3: disV alue ⇐ 1
4: repeat
5: for all n ∈ N and dis(n, d) == disV alue do
6: if uncertain(n) == true then
7: for all np ∈ adj(n) and dis(np, d) == disV alue + 1 do
8: addNode(np, todoS)
9: end for
10: end if
11: if n /∈ todoS then
12: addNode(n, setG)
13: end if
14: end for
15: for all n ∈ todoS do
16: if ∀i ∈ adj(n) and dis(i, d) == dis(n) − 1 and i ∈ sutG then
17: gambitT ree ⇐ buildGame(n, d)
18: equilibria ← solve(gamebitT ree)
19: removeNode(n, todoS)
20: addNode(n, setG)
21: end if
22: end for
23: disV alue ⇐ disV alue + 1
24: until s ∈ subG
```
shown in the left most rectangle all the players (including N2 and N4 as they are benign collaborating nodes) equally share the system utility value 6 with 3 hops from N3 to N5 plus the shortest path from N1 to N3. However, on the rightmost branch, only five players ruling out N2 and N4 is allocated with the system utility 4. The system utility is resulting from 6 hops if N3 decides to deliver the package to N4 as the nature problematically chooses the malicious type for N4, which sends the package back to N3 to maximize the attack's utility. Once N3 receives the package from N4, it redelivers the package to N6 because N3 as a good player does not repeatedly send it back. To this end, N2 and N4 is uniformly allocated the delta (i.e., 4) between the utility system obtained (i.e., 4) and the maximum utility system could obtain (i.e., 8) as the payoff. The payoff of the remaining branches can also be calculated accordingly.

After that, a pure Nash equilibrium is generated by solving this sub-game (line 18) with Gambit software tools [35], and the best strategy for the node is updated according to the equilibrium. By solving the sub-game for N3, the strategy for N3 in the equilibrium is to deliver the package to N6, as the potential detriment on delayed delivery time to N4 due to attacks is greater than its comparative advantage of the shortest path. Thus, this node with the solved strategy is removed from todoS (line 19) and absorbed in setG (line 21). Once all the nodes in the distance of disV alue from the destination have been iterated and all the nodes in todoS satisfying conditions are computed for their best strategy, the algorithm increment the value of disV alue one unit (line 23) and continue, until the starting point s is in the set setG (line 24).

#### **5.3 Experiment Setup & Results**

We demonstrate how our Bayesian game approach combined with the proposed dynamic programming algorithm can produce adaptation decisions about how to forward packages for each node in the routing example. Similar to the experiment results found on the Znn website, we statically analyzed a discretized region of the state space which represented different attack scenarios (i.e., malicious probability of N2 and N4). The entire experiment setup of the network structure is exactly shown in Figure 4. In addition, we also adopted a greedy algorithm for this routing application as the benchmark, and compared the system utility between these two approaches to demonstrate the superiority of game theory under security attacks. The experiment for the whole state space with Bayesian approach takes less than one minute and the solution generation time for each state is negligible.

Fig. 6: Results for interdomain route example: (a) Expected route in equilibrium; (b) System utility with game theory approach; (c) Delta between system utility from game theory approach and utility from greedy algorithm.

Figure 6 (a) presents the results of the strategy selection (i.e., expected package sequence) over two dimensions that correspond to the malicious probability of N2 and N4, respectively. Red triangle points denote that the strategy for N1 is N2, extending the range of P robability N2 to around [0, 0.50]. This is because when the chance of N2 coming under attack is less than 0.50, N1 should pass the package to N2, since N2 is in the shortest path to the destination; otherwise, N1 delivers the package to N3. Similarly, when the malicious probability of N4 is less than 0.35, the strategy for N3 reaching equilibrium is to deliver the package to N4 (i.e., blue square points), since the benefits of a short delivery time outweigh the potential detriment. For the remaining situations denoted by the black circle points, N1 passes the package to N3, which in turn forwards it to N6.

Figure 6 (b) describes the utility the system could obtain for the attacked components' equilibrium strategies. As expected, when the P robability N2 is greater than 50% and P robability N4 greater than 35% (i.e., black circle points in Figure 6 (a)), the utility system can gain is 6 as there are 4 hops in the expected sequence N1 N3 N6 N7 N5). This plot also shows that the system

utility increases progressively with decreasing probability of the compromised N2 and N4. When the probability N2 is 0, the expected utility increases to 8 (i.e., two hops in N1 N2 N5). Similarly, the utility reaches 7 with probability N4 0 (i.e., three hops in N1 N3 N4 N5).

Furthermore, we adopted a baseline that generates strategies for each node in a non-repeating fashion, passing the package to the adjacent node along the shortest path to the destination. The aim of this was to compare the utility between two different approaches dealing with security attacks. For the network as shown in Figure 4, the baseline firstly picks up the shortest path sequence N1 N2 N5. If N2 is compromised and sends the package back, N1 redelivers it to N3 instead of N2 since the package is received from N2. The system utility for the greedy algorithm is the expected value, the weighted average of utility for paths in different attack situations. Figure 6 (c) shows the delta between the utility produced by our game theory method and the utility produced by the baseline. During security attacks, we can see that the utility from the game theory approach is always higher than the greedy approach under security attacks. The delta is much more noticeable, especially in the situations where N2 and N4 are highly likely to be compromised (i.e., P robability N2 and P robability N4 close to 1). This is because game theory approaches can help the defenders to trade off the gains and losses due to perceived risks.

In summary, based on the preliminary results of our experiment, our game theory approach in the component level applies to self-adaptive applications. To adopt our approach, attacks information, such as various types with probabilities as well as its payoff, shall be provided from the *Analyzer*, to construct a Bayesian game based on system architectural structures. The results have also shown that game theory can enhance the performance of the system, especially when a potential attack is more likely to happen. In these situations, game theory approaches could help the defenders balance perceived risks by using underlying incentive mechanisms, and figure out the best response as the adaptation to be executed on the network using proven mathematics. Besides, our proposed dynamic programming algorithm is specific to this kind of application to optimize the game solving. Another potential application is the multi-agent finding (MAPF) problem where a spatial position in a path can be viewed as a node in the network [39,3]. Other optimization techniques might be adopted or customized for different applications with complicated game structures.

#### **6 Related Work**

Self-adaptive systems under security attacks need to make adaptation decisions as a response to a detected threat or to deviations from security goals and requirements [18]. Lorenzoli et al. [34] proposed a technique that could observe values at relevant program points and identified the execution contexts leading to a software failure so that mechanisms can be enabled for preventing future occurrences of failures of the same type. Bailey et al. [4] generated Role Based Access Control (RBAC) models to provide assurances for adaptations against insider threats. RBAC technique was also applied to cloud computing environment to provide appropriate security services according to the security level and dynamic changes

of the common resources [44]. Tsigkanos et al. [41] explored the use of Bigraphical Reactive Systems to perform speculative threat analysis through model checking. Burmester et al. [5] described a threat model to incorporate typical characteristics of systems, such as survivability to abnormal behavior and possibility to recover after critically vulnerable states are reached. Dimkov et al. [14] discussed insider threats that span physical, cyber and social domains and present a framework Portunes integrating all three security domains to describe attacks. Nashif et al. [2] presented a multi-level intrusion detection system to detect network attacks within three levels of granularities and proactively protected against them by employing a fusion decision algorithm. Although, there are many different ways of dealing with security attacks in self-adaptive systems, it is notable that the application of game theory, with the characteristic of modeling the adversarial nature of security attacks and designing reliable defense with proven mathematics, has not gained the deserved attention.

Different sorts of games have been employed to study the actions of the defender and attacker. Dijk et al. [42] presented a two-player game that reasons about security scenarios where an attacker with uncertainty about its actions may periodically gain full control of an asset, with each side trying to maintain control as much as possible. An extension work by Farhang et al. [19] explicitly modeled the information gains for the attackers as they control assets, improving attacker's capability. Based on these work, Kinneer et al. [28] additionally considered multiple attacker types with different goals and capabilities by Bayesian Game. Instead of modeling the attackers as independent players, our work models the attacks on the component level, focusing on the defender modeling at the architecture level and possible deviations of component behaviors. C´amara et al. [6,8] adopted a game-theoretic perspective and model the system as turn-based stochastic multi-player games between different players where players can either cooperate to achieve the same goal or compete to achieve their own goals. In addition, Glazier et al. [23] used game-based approach to automatically reason and synthesize strategies for meta-manager by explicitly considering alternate potential future state, thus improving the performance of a collection of autonomic systems against a defined quality objective. Though, some of these existing works concern about competitive behaviors in a system when some components cannot be controlled and even behave according to conflicting goals with respect to other components in the system. None of them, to the best of our knowledge, proposed to model the Bayesian game in an architecture/component level and captured multiple attacks as component's variant types as well as the uncertainty due to unsuccessful compromise.

Game theory is also increasingly applied to network security. Frigault et al. [20] measured the network security in a dynamic environment with dynamic Bayesian networks-based model to incorporate temporal factors. Charles et al. [26] developed a packet forwarding game model under imperfect private monitoring. Their equilibria rely on the probability of cooperation after observing a defection, similar to our routing games in the evaluation. However, they looked at this problem from the perspective of network nodes, without considering the situation

of being attacked and how to allocate rewards from the system utility for multiple components from the architecture perspective as illustrated in this work.

#### **7 Conclusion and Future Work**

In this paper, we have proposed a new framework for self-adaptive systems by adopting Bayesian game theory and modeled the system under security attacks as a multi-player game. An optimal adaptation strategy for responding to attacks is generated by computing the equilibrium to the game. One limitation is that we validate our approach on a simulated rather than an actual system, and we plan to further evaluate the applicability and scalability of the approach using case studies involving real systems. A second limitation is the simplification of the amount of uncertainty, such as restricting the number of component types under attacks and assuming the payoffs with zero-sum game, which might be more complex in the real world security landscape. Rather, we attempted to convey the idea of transforming the system architecture consisting of multiple components under attacks into a Bayesian game. While the equilibrium is sensitive to the probability distribution over types (i.e., malicious probability), sensitivity analysis are useful when the probability cannot be determined by the analysis with precision but lies within a known range. In addition, modeling attacks on component level, though more monitorable and easy to handle, cannot depict those attacks with highly motivated and capable adversaries willing to devote significant time and continuous attack to facilitate their malicious goals, known as advanced persistent threats (APTs) [28].

Moreover, we adopt pure equilibrium as the adaptation response. However, in practice, there will likely be multiple equilibria and no guarantee of uniqueness. While this is an area for future work, one possible way to overcome this is to choose the equilibrium with highest utility for the system. Another limitation, and a topic for future work, is that mixed equilibrium is another common solution for game theory. Its interpretation on system behaviors could be various and allows generation of different types of defense strategies for the system, which can be explored for different applications. For example, if the mixed strategy for N1 in routing game is choosing N2 and N3 in 50%/50% split as shown in Figure 4, we can consider that N1 may equally distribute its packages to N2 and N3 if multiple packages exist, or deliver its packages to N3 for the current time and to N2 next time. Also, the Bayesian games for these two examples were manually created by following the framework into the input language of the Gambit tool, to solve the equilibrium. In future, we are planning to construct the game in an automated way by supporting an architecture description interchange language, such as Acme [22].

#### **Acknowledgements**

The research is partially supported by the National Natural Science Foundation of China under Grant Nos. 61620106007 and 61751210, award N00014172899 from the Office of Naval Research and the NSA under Award No. H9823018D0008.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### An Abstract Contract Theory for Programs with Procedures *-*

Christian Lidström(-) and Dilian Gurov

KTH Royal Institute of Technology, Stockholm, Sweden {clid,dilian}@kth.se

Abstract. When developing complex software and systems, contracts provide a means for controlling the complexity by dividing the responsibilities among the components of the system in a hierarchical fashion. In specific application areas, dedicated contract theories formalise the notion of contract and the operations on contracts in a manner that supports best the development of systems in that area. At the other end, contract meta-theories attempt to provide a systematic view on the various contract theories by axiomatising their desired properties. However, there exists a noticeable gap between the most well-known contract metatheory of Benveniste et al. [5], which focuses on the design of embedded and cyber-physical systems, and the established way of using contracts when developing general software, following Meyer's design-by-contract methodology [18]. At the core of this gap appears to be the notion of procedure: while it is a central unit of composition in software development, the meta-theory does not suggest an obvious way of treating procedures as components.

In this paper, we provide a first step towards a contract theory that takes procedures as the basic building block, and is at the same time an instantiation of the meta-theory. To this end, we propose an abstract contract theory for sequential programming languages with procedures, based on denotational semantics. We show that, on the one hand, the specification of contracts of procedures in Hoare logic, and their procedure-modular verification, can be cast naturally in the framework of our abstract contract theory. On the other hand, we also show our contract theory to fulfil the axioms of the meta-theory. In this way, we give further evidence for the utility of the meta-theory, and prepare the ground for combining our instantiation with other, already existing instantiations.

# 1 Introduction

*Contracts.* Loosely speaking, a *contract* for a software or system component is a means of specifying that the component obliges itself to guarantee a certain behaviour or result, provided that the user (or client) of the component obliges itself to fulfil certain constraints on how it interacts with the component.

<sup>-</sup> This work has been funded by the Swedish Governmental Agency for Innovation Systems (VINNOVA) under the AVerT project 2018-02727.

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 152–171, 2021. https://doi.org/10.1007/978-3-030-71500-7\_8

One of the earliest inspirations for the notion of software contracts came from the works of Floyd [10] and Hoare [15]. One outcome of this was *Hoare logic*, which is a way of assigning meaning to sequential programs *axiomatically*, through so-called Hoare triples. A Hoare triple {P}S{Q} consists of two assertions P and Q over the program variables, called the pre-condition and post-condition, respectively, and a program S. The triple states that if the precondition P holds prior to executing S, then, if execution of S terminates, the post-condition Q will hold upon termination. With the help of additional, socalled *logical variables*, one can specify, with a Hoare triple, the desired relationship between the final values of certain variables (such as the return value of a procedure) and the initial values of certain other variables (such as the formal parameters of the procedure).

This style of specifying contracts has been advocated by Meyer [18], together with the design methodology Design-by-Contract. A central characteristic of this methodology is that it is well-suited for *independent implementation and verification*, where software components are developed independently from each other, based solely on the contracts, and without any knowledge of the implementation details of the other components.

*Contract Theories.* Since then, many other contract theories have emerged, such as Rely/Guarantee reasoning [16,22] and a number of Assume/Guarantee contract theories [4,6]. A contract theory typically formalises the notion of contract, and develops a number of operations on contracts that support typical design steps. This in turn has lead to a few developments of contract *meta-theories* (e.g. [5,2,8]), which aim at unifying these, in many cases incompatible, contract theories. The most comprehensive, and well-known, of these, is presented in Benveniste et al. [5], and is concerned specifically with the design of cyber-physical systems. Here, all properties are derived from a most abstract notion of a contract. The meta-theory focuses on the notion of contract *refinement*, and the operations of contract *conjunction* and *composition*. The intention behind refinement and composition is to support a top-down design flow, where contracts are decomposed iteratively into sub-contracts; the task is then to show that the composition of the sub-contracts refines the original contract. These operations are meant to enable *independent development* and *reuse* of components. In addition, the operation of conjunction is intended to allow the superimposition of contracts over *the same* component, when they concern different aspects of its behaviour. This also enables *component reuse*, by allowing contracts to reveal only the behaviour relevant to the different use cases.

*Motivation and Contribution.* The meta-theory of Benveniste et al. focuses on the design of embedded and cyber-physical systems. However, there exists a noticeable gap between this meta-theory and the way contracts are used when developing general software following Meyer's design-by-contract methodology. At the core of this gap appears to be the notion of *procedure*<sup>1</sup>. While the proce-

<sup>1</sup> We use the term "procedure", rather than "function" or "method", to refer to the well-known control abstraction mechanism of imperative programming languages.

dure is a central unit of composition in software development, the meta-theory does not suggest an obvious way of treating procedures as components. This situation is not fully satisfactory, since the software components of most embedded systems are implemented with the help of procedures (a typical C-module, for instance, would consist of a main function and a number of *helper* functions), and their development should ideally follow the same design flow as that of the embedded system as a whole.

In this paper we provide a first step towards a contract theory that takes procedures as the basic building block, and at the same time respects the axioms of the meta-theory. Our contract theory is abstract, so that it can be instantiated to any procedural language, and similarly to the meta-theory, is presented at the semantics level only. Then, in the context of a simplistic imperative programming language with procedures and its denotational semantics, we show that the specification of contracts of procedures in *Hoare logic*, and their procedure-modular verification, can be cast in the framework of our abstract contract theory. We also show that our contract theory is an instance of the meta-theory of Benveniste et al. With this we expect to contribute to the bridging of the gap mentioned above, and to give a formal justification of the design methodology supported by the meta-theory, when applied to the software components of embedded systems. Several existing contract theories have already been shown to instantiate the meta-theory. In providing a contract theory for procedural programs that also instantiates it, we increase the value of the metatheory by providing further evidence for its universality. In addition, we prepare the theoretical ground for combining our instantiation with other instantiations, which may target components not to be implemented in software.

Our theoretical development should be seen as a proof-of-concept. In future work it will need to be extended to cover more programming language features, such as object orientation, multi-threading, and exceptions.

*Related Work.* Software contracts and operations on contracts have long been an area of intensive research, as evidenced, e.g., by [1]. We briefly mention some works related to our theory, in addition to the already mentioned ones.

Reasoning from multiple Hoare triples is studied in [21], in the context of unavailable source code, where new properties cannot be derived by re-verification. In particular, it is found that two Hoare-style rules, the standard rule of consequence and a generalised normalisation rule, are sufficient to infer, from a set of existing contracts for a procedure, any contract that is semantically entailed.

Often-changing source code is a problem for contract-based reasoning and contract reuse. In [13], abstract method calls are introduced to alleviate this problem. Fully abstract contracts are then introduced in [7], allowing reasoning about software to be decoupled from contract applicability checks, in a way that not all verification effort is invalidated by changes in a specification.

The relation between behavioural specifications and assume/guarantee-style contracts for modal transition systems is studied in [2], which shows how to build a contract framework from any specification theory supporting composition and refinement. This work is built on in [9], where a formal contract framework based on temporal logic is presented, allowing verification of correctness of contract refinement relative to a specific decomposition.

A survey of behavioural specification languages [14] found that existing languages are well-suited for expressing properties of software components, but it is a challenge to express how components interact, making it difficult to reason about system and architectural level properties from detailed design specifications. This provides additional evidence for the gap between contracts used in software verification and contracts as used in system design.

*Structure.* The paper is organised as follows. Section 2 recalls the concept of contract based design and the contract meta-theory considered in the present paper. In Section 3 we present a denotational semantics for programs with procedures, including a semantics for contracts for use in procedure-modular verification. Next, Section 4 presents our abstract contract theory for sequential programs with procedures. Then, we show in Section 5 that our contract theory fulfils the axioms of the meta-theory, while in Section 6 we show how the specification of contracts of procedures in Hoare logic and their procedure-modular verification can be cast in the framework of our abstract contract theory. We conclude with Section 7.

#### 2 Contract Based Design

This section describes the concept of *contract based design*, and motivates its use in cyber-physical systems development. We then recall the contract meta-theory by Benveniste et al. [5].

#### 2.1 Contract Based Design of Cyber-Physical Systems

*Contract based design* is an approach to systems design, where the system is developed in a top-down manner through the use of contracts for components, which are incrementally assembled so that they preserve the desired system-wide properties. Contracts are typically described by a set of *assumptions* the component makes on its environment, and a set of *guarantees* on the component's behaviour, given that it operates in an environment adhering to the assumptions [5].

Present-day cyber-physical systems, such as those found in the automotive, avionics and other industries, are extremely complex. Products assembled by Original Equipment Manufacturers (OEMs) often consist of components from a number of different suppliers, all using their own specialised design processes, system architectures, development platforms, and tools. This is also true inside the OEMs, where there are different teams with different viewpoints of the system, and their own design processes and tools. In addition, the system itself has several different aspects that need to be managed, such as the architecture, safety and security requirements, functional behaviour, and so on. Thus, a rigorous design framework is called for that can solve these design-chain management issues.

Contract based design addresses these challenges through the principles, at the specification level, of *refinement* and *abstraction*, which are processes for managing the design flow between different layers of abstraction, and *composition* and *decomposition*, which manage the flow at the same level of abstraction. Generally, when designing a system, at the top level of abstraction there will be an overall system specification (or contract). This *top-level contract* is then refined, to provide a more concrete contract for the system, and decomposed, in order to obtain contracts for the sub-systems, and to separate the different viewpoints of the system. A system design typically iterates the decompositionand-refinement process, resulting in several layers of abstraction, until contracts are obtained that can be directly implemented, or for which implementations already exist. An important requirement on this methodology of hierarchical decomposition and refinement of contracts is that it must guarantee that when the low-level components implement their concrete contracts, and are combined to form the overall system, then the top-level, abstract, contract shall hold.

Furthermore, a contract framework in particular needs to support *independent development* and *component reuse*. That is, specifications for components, and their operations, must allow for components and specifications to be independently designed and implemented, and to be used in different parts of the system, each with their own assumptions on how the other components, the environment, behave. This is achieved through the principle operations on contracts: *refinement*, *composition*, and *conjunction*.

Refinement allows one to extract a contract at the appropriate level of abstraction. A desired property of refinement is that components which have been designed with reference to the more abstract (i.e., weaker) contract do not need to be re-designed after the refinement step. That is, in the early stages of development an OEM may have provided a weak contract for some subsystem to an external supplier, which implemented a component relying on this contract. As development of the system progresses, and the contract is refined, the component supplied externally should still operate according to its guarantees without needing to be changed, when instead assuming the new, refined, contract.

Composition enables one to combine contracts of different components into a contract for the larger subsystem obtained when combining the components. Again, a desirable property is that other components relying on one or more of the individual contracts, can, after composition of the contracts, assume the new contract and still perform its guarantees, without being re-designed, thus ensuring that subsystems can be independently implemented.

Finally, contract conjunction is another way of combining contracts, but now for the different viewpoints of a single component. This allows one to separate a contract into several different, finer contracts for the same component, revealing just enough information for each particular system that depends on it, so that it can be reused in different parts of the system, or in entirely different systems.

#### 2.2 A Contract Meta-Theory

We consider the meta-theory described in [5]. The stated purpose of the metatheory has been to distil the notion of a contract to its essence, so that it can be used in system design methodologies without ambiguities. In particular, the meta-theory has been developed to give support for design-chain management, and to allow *component reuse* and *independent development*. It has been shown that a number of concrete contract theories instantiate it, including assume/guarantee-contracts, synchronous Moore interfaces, and interface theories. To our knowledge, this is the only meta-theory of its purpose and scope.

We now present the formal definitions of the concepts defined in the metatheory, and the properties that they entail. The meta-theory is defined only in terms of semantics, and it is up to particular concrete instantiations to provide a syntax.

*Components.* The most basic concept in the meta-theory is that of a *component*, which represents any concrete part of the system. Thus, we have an abstract component universe <sup>M</sup> with components <sup>m</sup> <sup>∈</sup> <sup>M</sup>. Over pairs of components, we have a *composition* operation ×. This operation is partially defined, and two components m<sup>1</sup> and m<sup>2</sup> are called *composable* when m<sup>1</sup> × m<sup>2</sup> is defined. In such cases, we call m<sup>1</sup> an *environment* for m2, and vice versa. In addition, component composition must be both commutative and associative, in order to ensure that different components can be combined in any order.

Typically, components are *open*, in the sense that they contain functionality provided by other components, i.e., their environment. The environment in which a component is to be placed is often unknown at development time, and although a component cannot restrict it, it is designed for a certain context.

*Contracts.* In the meta-theory, the notion of *contract* is defined in terms of sets of components. The contract universe <sup>C</sup> def = 2<sup>M</sup> <sup>×</sup> <sup>2</sup><sup>M</sup> consists of contracts C = (E,M), where E and M are the sets of *environments* and *implementations* of C, respectively. Importantly, each pair (m1, m2) ∈ E×M must be composable. This definition is intentionally abstract. The intuition is that contracts separate the responsibilities of a component from the expectations on its environment. Moreover, contracts are best seen as *weak specifications* of components: they should expose just enough information to be adequate for their purpose.

For a component m and a contract C = (E,M), we shall sometimes write <sup>m</sup> <sup>|</sup>=<sup>E</sup> <sup>C</sup> for <sup>m</sup> <sup>∈</sup> <sup>E</sup>, and <sup>m</sup> <sup>|</sup>=<sup>M</sup> <sup>C</sup> for <sup>m</sup> <sup>∈</sup> <sup>M</sup>. A contract <sup>C</sup> is said to be *consistent* if it has at least one implementation, and *compatible* if it has at least one environment.

*Contract refinement.* For two contracts C<sup>1</sup> = (E1, M1) and C<sup>2</sup> = (E2, M2), C<sup>1</sup> is said to *refine* C2, denoted C<sup>1</sup> C2, iff M<sup>1</sup> ⊆ M<sup>2</sup> and E<sup>2</sup> ⊆ E1. As an axiom of the meta-theory, it is required that the greatest lower bound with respect to refinement exists, for all subsets of C. Table 1 summarises the important properties of refinement and the other operations on contracts that a concrete

Table 1. Properties that hold in theories that adhere to the meta-theory.


contract theory needs to possess in order to be considered an instance of the meta-theory.

*Contract conjunction.* The *conjunction* of two contracts C<sup>1</sup> and C2, denoted C<sup>1</sup> ∧ C2, is defined as their greatest lower bound w.r.t. the refinement order. (The intention is that (E1, M1)∧(E2, M2) should equal (E<sup>1</sup> ∪ E2, M<sup>1</sup> ∩M2); however, this cannot be taken as the definition since not every such pair necessarily constitutes a contract.) Then, we have the three desirable properties of conjunction listed in Table 1, which together are referred to as *shared refinement*.

*Contract composition.* The *composition* of two contracts C<sup>1</sup> = (E1, M1) and C<sup>2</sup> = (E2, M2), denoted C1⊗C<sup>2</sup> = (E,M), is defined when every two components m<sup>1</sup> ∈ M<sup>1</sup> and m<sup>2</sup> ∈ M<sup>2</sup> are composable, and must then be the least contract, w.r.t. the refinement order, satisfying the following conditions:

(i) m<sup>1</sup> ∈ M<sup>1</sup> ∧ m<sup>2</sup> ∈ M<sup>2</sup> ⇒ m<sup>1</sup> × m<sup>2</sup> ∈ M; (ii) e ∈ E ∧ m<sup>1</sup> ∈ M<sup>1</sup> ⇒ m<sup>1</sup> × e ∈ E2; and (iii) e ∈ E ∧ m<sup>2</sup> ∈ M<sup>2</sup> ⇒ e × m<sup>2</sup> ∈ E2.

If all of the above is satisfied, then properties 3-6 of Table 1 hold. The intention is that composing two components implementing C<sup>1</sup> and C<sup>2</sup> should yield an implementation of C<sup>1</sup> ⊗ C2, and composing an environment of C<sup>1</sup> ⊗ C<sup>2</sup> with an implementation of C<sup>1</sup> should result in a valid environment for C2, and vice versa. This is important in order to enable independent development.

# 3 Denotational Semantics of Programs and Contracts

In this section we summarise the background needed to understand the formal developments later in the paper. First, we recall the standard denotational semantics of programs with procedures on a typical toy programming language. Next, we summarise Hoare logic and contracts, and provide a semantic justification of procedure-modular verification, also based on denotational semantics.

#### 3.1 The Denotational Semantics of Programs with Procedures

This section sketches the standard presentation of denotational semantics for procedural languages, as presented in textbooks such as [23,19]. This semantics is the inspiration for the definition of components in our abstract contract theory in Section 4.1. We start with a simplistic programming language not involving procedures, and add procedures later to the language.

The following toy sequential programming language is typically used to present the denotational semantics of imperative languages:

$$\text{If } S \text{ ::= } \text{skip} \mid \text{x} := \text{a} \mid S\_1; S\_2 \mid \text{if } b \text{ then } S\_1 \text{ else } S\_2 \mid \text{while } b \text{ do } S\_1$$

where S ranges over statements, a over arithmetic expressions, and b over Boolean expressions.

To define the denotational semantics of the language, we define the set **State** of program states. A state s ∈ **State** is a mapping from the program variables to, for simplicity, the set of integers.

The *denotation* of a statement S, denoted [[S]], is typically given as a partial function **State** → **State** such that [[S]] (s) = s whenever executing statement S from the initial state s terminates in state s . In case that executing S from s does not terminate, the value of [[S]] (s) is undefined. The definition of [[S]] proceeds by induction on the structure of S. For example, the meaning of sequential composition of statements is usually captured with relation composition, as given by the equation [[S1; <sup>S</sup>2]] def = [[S1]] ◦ [[S2]]. For the treatment of the remaining statements of the language, the reader is referred to [23,19].

The definition of denotation captures through its type (as a partial function) that the execution of statements is deterministic. For non-deterministic programs, the type of denotations is relaxed to [[S]] ⊆ **State** × **State**; then, (s, s ) ∈ [[S]] captures that there is an execution of S starting in s that terminates in s . For technical reasons that will become clear below, we shall use this latter denotation type in our treatment.

Note that we could alternatively have chosen **State**<sup>+</sup> as the denotational domain, and most results would still hold in the context of finite-trace semantics. However, we chose to develop the theory with a focus on Hoare-logic and deductive verification. In fact, the domain **State** × **State** can be seen as a special case of finite traces. In future work, we will also investigate concrete contract languages based on this semantics, and extend the theory for that context.

*Procedures and Procedure Calls.* To extend the language and its denotational semantics with procedures and procedure calls, we follow again the approach of [23], but adapt it to an "open" setting, where some called procedures might not be declared. We consider programs in the context of a finite set P of procedure names (of some larger, "closed" program), and a set of *procedure declarations* of the form proc p is Sp, where p ∈ P. Further, we extend the toy programming language with the statement call p.

Listing 1.1. An even-odd toy program.

proc even i s i f n = 0 then r := 1 el s e (n := n − 1; c all odd ); proc odd i s i f n = 0 then r := 0 el s e (n := n − 1; call even )

As an example, Listing 1.1 shows a (closed) program in the toy language, implementing two mutually recursive procedures. The procedures check whether the value of the global variable n is even or odd, respectively, and assign the corresponding truth value to the variable r.

Due to the (potential) recursion in the procedure declarations, the denotation of call p, and thus of the whole language, cannot be defined by structural induction as directly as before. We therefore define, for any set P ⊆ P of procedure names, the set **Env**<sup>P</sup> <sup>=</sup> <sup>P</sup> <sup>→</sup> <sup>2</sup>**State**×**State** of *procedure environments*, each environment ρ ∈ **Env**<sup>P</sup> thus providing a denotation for each procedure in P.

Let **Env** def = <sup>P</sup> ⊆P **Env**<sup>P</sup> be the set of all procedure environments. We define a partial order relation on procedure environments, as follows. For any two procedure environments ρ ∈ **Env**<sup>P</sup> and ρ ∈ **Env**<sup>P</sup> , ρ ρ if and only if P ⊆ P and ∀p ∈ P. ρ(p) ⊆ ρ (p).

Recall that a *complete lattice* is a partial order, every set of elements of which has a greatest lower bound (*glb*) within the domain of the lattice (see, e.g., [23]). It is easy to show that for any P ⊆ P, (**Env**<sup>P</sup> , ) is a complete lattice, since a greatest lower bound will exist within **Env**<sup>P</sup> . Then, the least upper bound (*lub*) ρ1ρ<sup>2</sup> of any two function environments ρ<sup>1</sup> ∈ **Env**<sup>P</sup><sup>1</sup> and ρ<sup>2</sup> ∈ **Env**<sup>P</sup><sup>2</sup> also exists, and is the environment ρ ∈ **Env**<sup>P</sup>1∪P<sup>2</sup> such that ∀p ∈ P1∪P2. ρ(p) = ρ1(p)∪ρ2(p).

We will sometimes need a procedure environment that maps every procedure in P to **State** × **State**, and we shall denote this environment by ρ P .

Next, for sets of procedures, we shall need the notion of *interface*, which is a pair (<sup>P</sup> <sup>−</sup>, P <sup>+</sup>) of disjoint sets of procedure names, where <sup>P</sup> <sup>+</sup> ⊆ P is a set of *provided* (or declared) procedures, and P <sup>−</sup> ⊆ P a set of *required* (or called, but not declared) ones.

Then, we (re)define the notion of denotation of statements S in the context of a given interface (<sup>P</sup> <sup>−</sup>, P <sup>+</sup>) and environments <sup>ρ</sup><sup>−</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>−</sup> and <sup>ρ</sup><sup>+</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>+</sup> , and denote it by [[S]]<sup>ρ</sup><sup>+</sup> <sup>ρ</sup><sup>−</sup> . In particular, we define [[call <sup>p</sup>]]<sup>ρ</sup><sup>+</sup> <sup>ρ</sup><sup>−</sup> as ρ−(p) when p ∈ P <sup>−</sup> and as <sup>ρ</sup><sup>+</sup>(p) when <sup>p</sup> <sup>∈</sup> <sup>P</sup> <sup>+</sup>.

Intuitively, the denotation of a call to a procedure should be equal to the denotation of the body of the latter. We therefore introduce, given an environment <sup>ρ</sup><sup>−</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>−</sup> , the function <sup>ξ</sup> : **Env**<sup>P</sup> <sup>+</sup> <sup>→</sup> **Env**<sup>P</sup> <sup>+</sup> defined by <sup>ξ</sup>(ρ<sup>+</sup>)(p) def = [[Sp]]<sup>ρ</sup><sup>+</sup> ρ− for any <sup>ρ</sup><sup>+</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>+</sup> and <sup>p</sup> <sup>∈</sup> <sup>P</sup> <sup>+</sup>, and consider its fixed points. By the Knaster-Tarski Fixed-Point Theorem (as stated, e.g., in [23]), since (**Env**<sup>P</sup> <sup>+</sup> , ) is a complete lattice and ξ is monotonic, ξ has a least fixed-point ρ<sup>+</sup> 0 .

Finally, we define the notion of *standard denotation* of statement S in the context of a given interface (<sup>P</sup> <sup>−</sup>, P <sup>+</sup>) and environment <sup>ρ</sup><sup>−</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>−</sup> , denoted [[S]]<sup>ρ</sup><sup>−</sup> , by [[S]]<sup>ρ</sup><sup>−</sup> def = [[S]]<sup>ρ</sup><sup>+</sup> 0 <sup>ρ</sup><sup>−</sup> , where <sup>ρ</sup><sup>+</sup> <sup>0</sup> is the least fixed-point defined above. For example, for the closed program in Listing 1.1, we have an interface with <sup>P</sup> <sup>+</sup> <sup>=</sup> {*even*, *odd*} and <sup>P</sup> <sup>−</sup> <sup>=</sup> <sup>∅</sup>. Then, (s, s ) <sup>∈</sup> [[S*even*]]<sup>ρ</sup><sup>+</sup> <sup>ρ</sup><sup>−</sup> if either s(n)=0 and s = s[r '→ 1], or else if s(n) > 0 and (s[n '→ s(n) − 1], s ) <sup>∈</sup> <sup>ρ</sup><sup>+</sup>(*odd*). The denotation [[S*odd* ]]<sup>ρ</sup><sup>+</sup> <sup>ρ</sup><sup>−</sup> is analogous. The resulting least fixed-point <sup>ρ</sup><sup>+</sup> <sup>0</sup> is such that (s, s ) ∈ [[S*even*]]<sup>ρ</sup><sup>−</sup> , or equivalently (s, s ) <sup>∈</sup> [[S*even*]]<sup>ρ</sup><sup>+</sup> 0 <sup>ρ</sup><sup>−</sup> , whenever s(n) ≥ 0, and either s(n) is even and then s (n)=0 and s (r)=1, or else s(n) is odd and then s (n)=0 and s (r)=0. The standard denotation [[S*odd* ]]<sup>ρ</sup><sup>−</sup> of *odd* is analogous.

#### 3.2 Hoare Logic and Contracts

In this section we summarise the denotational semantics of Hoare logic and the semantic justification of procedure-modular verification, as developed by the second author in [12]. These formalisations serve as the starting point for the definition of contracts in our contract theory developed in Section 4.2.

*Hoare Logic.* The basic judgement of Hoare logic [15] is the Hoare triple, written {P}S{Q}, where P and Q are assertions over the program state, and S is a program statement. The Hoare triple signifies that if the statement S is executed from a state that satisfies P (called the pre-condition), and if this execution terminates, then the final state of the execution will satisfy Q (called the postcondition). Additionally, so-called *logical variables* can be used within a Hoare triple, to specify the desired relationship between the values of variables after execution and the values of variables before execution. The values of the program variables are defined by the notion of state; to give a meaning to the logical variables we shall use *interpretations* I. We shall write s |=<sup>I</sup> P to signify that the assertion P is true w.r.t. state s and interpretation I. The formal validity of a Hoare triple is denoted by |=*par* {P}S{Q}, where the subscript signifies that validity is in terms of *partial correctness*, where termination of the execution of S is not required.

An example of a Hoare triple, stating the desired behaviour of procedure *odd* from Listing 1.1, is shown below, where we use the logical variable n<sup>0</sup> to capture to the value of n prior to execution of *odd*:

$$\left\{ n \ge 0 \land n = n\_0 \right\} \ S\_{odd} \left\{ (n\_0 \bmod 2 = 0 \Rightarrow r = 0) \land (n\_0 \bmod 2 = 1 \Rightarrow r = 1) \right\} \tag{1}$$

Procedure *even* is specified analogously.

Hoare logic comes with a proof calculus for reasoning in terms of Hoare triples, consisting of proof rules for the different types of statements of the programming language. An example is the rule for sequential composition:

$$\frac{\begin{Bmatrix} P \end{Bmatrix} S\_1 \begin{Bmatrix} R \end{Bmatrix} & \begin{Bmatrix} R \end{Bmatrix} S\_2 \begin{Bmatrix} Q \end{Bmatrix}}{\begin{Bmatrix} P \end{Bmatrix} S\_1; S\_2 \begin{Bmatrix} Q \end{Bmatrix}} \text{ COMPOSATION}$$

which essentially states that if executing S<sup>1</sup> from any state satisfying P terminates (if at all) in some state satisfying R, and executing S<sup>2</sup> from any state satisfying R terminates (if at all) in some state satisfying Q, then it is the case that executing the composition S1; S<sup>2</sup> from any state satisfying P terminates (if at all) in some state satisfying Q. The proof system is sound and relatively complete w.r.t. the denotational semantics of the programming language (see, e.g., [23,19]).

*Hoare Logic Contracts.* One can view a Hoare triple {P}S{Q} as a *contract* C = (P, Q) imposed on the program S. In many contexts it is meaningful to separate the contract from the program; for instance, if the program is yet to be implemented. In our earlier work [12], we gave such contracts a denotational semantics as follows:

$$\left\{ \left. C \right\} \stackrel{\text{def}}{=} \left\{ (s, s') \mid \forall \mathcal{L}. \left( s \right| \vdash\_{\mathcal{L}} P \Rightarrow s' \middle| \vdash\_{\mathcal{L}} Q \right) \right\} \tag{2}$$

The rationale behind this definition is the following desirable property: a program *meets* a contract whenever its denotation is subsumed by the denotation of the contract, i.e., S |=*par* C if and only if [[S]] ⊆ [[C]].

For example, for the contract C*odd* induced by (1) we have that (s, s ) ∈ [[C*odd* ]] if and only if either s(n) < 0, or else s (r)=0 if s(n) is even and s (r)=1 if s(n) is odd. The denotation of C*even* is analogous.

*The Denotational Semantics of Programs with Procedure Contracts.* Let S be a program with procedures, and let every declared procedure p ∈ P be equipped with a procedure contract Cp. *Procedure-modular verification* refers to techniques that verify every procedure in isolation. The key to this is to handle procedure calls by using the contract of the called procedure rather than its body (i.e., by *contracting* rather than by *inlining* [7]). In [12], a semantic justification of this is given by means of a *contract-relative* denotational semantics of statements. The intuition behind this semantics is that procedure calls are given a meaning through the denotation of the contract of the called procedure, rather than through the denotation of its body.

The contract-relative denotational semantics of a statement S, denoted [[S]]cr, is defined with the help of the *contract environment* ρ<sup>c</sup> that is induced by the procedure contracts, i.e., <sup>ρ</sup>c(p) def = [[Cp]] for all <sup>p</sup> ∈ P, as [[S]]cr def = [[S]]<sup>ρ</sup><sup>c</sup> . Notice that this definition does not involve solving any recursive equations (i.e., finding fixed points), and gives rise to a contract-relative notion of when a statement meets a contract, namely <sup>S</sup> <sup>|</sup>=*cr par* <sup>C</sup> if and only if [[S]]cr <sup>⊆</sup> [[C]]. This is exactly the correctness notion that is the target of procedure-modular verification. As shown in [12], this notion is *sound* w.r.t. the original notion S |=*par* C, in the sense that <sup>S</sup> <sup>|</sup>=*cr par* C entails S |=*par* C. In other words, verifying a program procedure-modularly establishes that the program is correct w.r.t. its contract in the standard sense.

For example, the contract-relative semantics of S*even* is such that (s, s ) ∈ [[S*even*]]cr if either <sup>s</sup>(n) <sup>&</sup>lt; <sup>0</sup>, or <sup>s</sup>(n)=0 and <sup>s</sup> <sup>=</sup> <sup>s</sup>[<sup>r</sup> '→ 1], or else <sup>s</sup> (r)=1 if s(n) is even and s (r)=0 if s(n) is odd. The contract-relative semantics of <sup>S</sup>*odd* is analogous. Then, it is easy to check that both <sup>S</sup>*even* <sup>|</sup>=*cr par* C*even* and <sup>S</sup>*odd* <sup>|</sup>=*cr par* C*odd* hold.

#### 4 An Abstract Contract Theory

This section presents an abstract contract theory for programs with procedures. The theory builds on the basic notion of *denotation* as a binary relation over states. As we will show later, it is both an abstraction of the denotational semantic view on programs with procedures and procedure contracts presented in Sections 3.1 and 3.2, and an instantiation of the meta-theory described in Section 2.2.

#### 4.1 Components

In the context of a concrete programming language, we view a component as a module, consisting of a collection of procedures that are *provided* by the module. The module may call *required* procedures that are external to the module. The way the provided procedures transform the program state upon a call depends on how the required procedures transform the state. We take this observation as the basis of our abstract setting, in which state transformers are modelled as denotations (i.e., as binary relations over states). A component will thus be simply a mapping from denotations of the required procedures to denotations of the provided ones, both captured through the notion of procedure environments.

The contract theory is abstract, in that it is not defined for a particular programming language, and may be instantiated with any procedural language. As with the meta-theory, the abstract contract theory is also defined only on the semantic level.

Recall the notions and notation from Section 3.1. A component *interface* I = (P <sup>−</sup>, P <sup>+</sup>) is a pair of disjoint, finite sets of procedure names, of the required and the provided ones, respectively.

Definition 1 (Component). *A* component m *with interface* I<sup>m</sup> = (P <sup>−</sup> <sup>m</sup>, P <sup>+</sup> <sup>m</sup>) *is a mapping* m : **Env**<sup>P</sup> <sup>−</sup> <sup>m</sup> → **Env**<sup>P</sup> <sup>+</sup> m *.*

Let M denote the universe of all components over P.

We assume that any system is built up from a set of *base components*, the simplest components from which more complex components are then obtained by composition. The base components must be *monotonic* functions over the lattice defined in Section 3.1.

When P <sup>−</sup> <sup>m</sup> <sup>=</sup> <sup>∅</sup>, we shall identify <sup>m</sup> with an element of **Env**<sup>P</sup> <sup>+</sup> <sup>m</sup> . In other words, when a component is *closed*, i.e., is not dependent on any external procedures, the provided environment is constant.

Definition 2 (Component composability). *Two components* m<sup>1</sup> *and* m<sup>2</sup> *are* composable *iff* P <sup>+</sup> <sup>m</sup><sup>1</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>m</sup><sup>2</sup> <sup>=</sup> <sup>∅</sup>*.*

When defining the composition of two components, particular care is required in the treatment of procedure names that are provided by one of the components while required by the other. Let μx. f(x) denote the least fixed-point of a function f, when it exists.

Definition 3 (Component composition). *Given two composable components* m<sup>1</sup> : **Env**<sup>P</sup> <sup>−</sup> m1 → **Env**<sup>P</sup> <sup>+</sup> m1 *and* m<sup>2</sup> : **Env**<sup>P</sup> <sup>−</sup> m2 → **Env**<sup>P</sup> <sup>+</sup> m2 *, their* composition *is defined as a mapping* m<sup>1</sup> × m<sup>2</sup> : **Env**<sup>P</sup> <sup>−</sup> m1×m2 → **Env**<sup>P</sup> <sup>+</sup> m1×m2 *such that:*

$$\begin{aligned} &P^+\_{m\_1 \times m\_2} \stackrel{\text{def}}{=} P^+\_{m\_1} \cup P^+\_{m\_2} \\ &P^-\_{m\_1 \times m\_2} \stackrel{\text{def}}{=} (P^-\_{m\_1} \cup P^-\_{m\_2}) \backslash (P^+\_{m\_1} \cup P^+\_{m\_2}) \\ &m\_1 \times m\_2 \stackrel{\text{def}}{=} \lambda \rho^-\_{m\_1 \times m\_2} \in \mathbf{Env}\_{P^-\_{m\_1 \times m\_2}} \cdot \mu \rho . \ \chi^+\_{m\_1 \times m\_2}(\rho) \end{aligned}$$

*where* χ<sup>+</sup> <sup>m</sup>1×m<sup>2</sup> : **Env**<sup>P</sup> <sup>+</sup> m1×m2 → **Env**<sup>P</sup> <sup>+</sup> m1×m2 *is defined, in the context of a given* ρ<sup>−</sup> <sup>m</sup>1×m<sup>2</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>−</sup> m1×m2 *, as follows. Let* ρ<sup>+</sup> <sup>m</sup>1×m<sup>2</sup> <sup>∈</sup> **Env**<sup>P</sup> <sup>+</sup> m1×m2 *, and let* ρ<sup>−</sup> <sup>m</sup><sup>1</sup> ∈ **Env**<sup>P</sup> <sup>−</sup> m1 *be the environment defined by:*

$$
\rho\_{m\_1}^-(p) \stackrel{\text{def}}{=} \begin{cases}
\rho\_{m\_1 \times m\_2}^+(p) & \text{if } p \in P\_{m\_1}^- \cap P\_{m\_2}^+ \\
\rho\_{m\_1 \times m\_2}^-(p) & \text{if } p \in P\_{m\_1}^- \backslash P\_{m\_2}^+ \\
\end{cases}
$$

*and let* ρ<sup>−</sup> <sup>m</sup><sup>2</sup> ∈ **Env**<sup>P</sup> <sup>−</sup> m2 *be defined symmetrically. We then define:*

$$(\chi\_{m\_1 \times m\_2}^+ (\rho\_{m\_1 \times m\_2}^+) (p) \overset{\text{def}}{=} \begin{cases} m\_1 (\rho\_{m\_1}^-) (p) & \text{if } p \in P\_{m\_1}^+ \\ m\_2 (\rho\_{m\_2}^-) (p) & \text{if } p \in P\_{m\_2}^+ \end{cases}$$

In the above definition, χ<sup>+</sup> <sup>m</sup>1×m<sup>2</sup> represents the denotations of the procedure *bodies* of the procedures provided by the two composed components, given denotations of procedure *calls* to the same procedures. The choice of least fixed-point will be crucial for the proof of Theorem 2(i) in Section 4.2 below.

The definition is well-defined, in the sense that the stated least fixed-points exist, and the resulting components are monotonic functions.

#### Theorem 1. *Component composition is well-defined.*

The existence of a least fixed-point follows from the Knaster-Tarski Fixed-Point Theorem, as stated, e.g., in [23]. It can then be shown, by structural induction, that composition is well-defined. For lack of space, the proofs of all theorems, some of which are conceptually not very involved but rather verbose, are omitted here. The full proofs can be found in the accompanying technical report [17].

#### 4.2 Denotational Contracts

We now define the notion of denotational contracts c in the style of *assume/guarantee contracts* [4,6]. Contracts shall also be given interfaces.

Definition 4 (Denotational contract). *A* denotational contract c *with interface* I<sup>c</sup> = (P <sup>−</sup> <sup>c</sup> , P <sup>+</sup> <sup>c</sup> ) *is a pair* (ρ<sup>−</sup> <sup>c</sup> , ρ<sup>+</sup> <sup>c</sup> )*, where* ρ<sup>−</sup> <sup>c</sup> ∈ **Env**<sup>P</sup> <sup>−</sup> <sup>c</sup> *and* <sup>ρ</sup><sup>+</sup> <sup>c</sup> ∈ **Env**<sup>P</sup> <sup>+</sup> c *.*

The intended interpretation of the environment pair is as follows: *assuming* that the denotation of every called procedure p ∈ P <sup>−</sup> <sup>c</sup> is subsumed by ρ<sup>−</sup> <sup>c</sup> (p), then it is *guaranteed* that the denotation of every provided procedure <sup>p</sup> <sup>∈</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup> is subsumed by ρ<sup>+</sup> <sup>c</sup> (p ).

Definition 5 (Contract implementation). *A component* m *with interface* I<sup>m</sup> = (P <sup>−</sup> <sup>m</sup>, P <sup>+</sup> <sup>m</sup>) *is an* implementation *for, or* implements*, a contract* c = (ρ<sup>−</sup> <sup>c</sup> , ρ<sup>+</sup> <sup>c</sup> ) *with interface* I<sup>c</sup> = (P <sup>−</sup> <sup>c</sup> , P <sup>+</sup> <sup>c</sup> )*, denoted* m |= c*, iff* P <sup>−</sup> <sup>c</sup> ⊆ P <sup>−</sup> <sup>m</sup>*,* P <sup>+</sup> <sup>m</sup> <sup>⊆</sup> <sup>P</sup> <sup>+</sup> c *, and* m(ρ<sup>−</sup> <sup>c</sup> ρ P − <sup>m</sup>\P <sup>−</sup> c ) <sup>ρ</sup><sup>+</sup> c *.*

The reason for not requiring the interfaces to be equal is that we aim at a subset relation between components implementing a contract and those implementing a refinement of said contract, in the meta-theory instantiation.

For a mapping h : A → B and set A ⊆ A, let h|A denote as usual the restriction of h on A .

Definition 6 (Contract environment). *A component* m *is an* environment *for contract* c *iff, for any implementation* m *of* c*,* m *and* m *are composable, and* ∀ρ<sup>−</sup> <sup>m</sup>×m <sup>∈</sup> **Env**<sup>P</sup> <sup>−</sup> m×m . (m × m )(ρ<sup>−</sup> <sup>m</sup>×m )|<sup>P</sup> <sup>+</sup> <sup>c</sup> <sup>ρ</sup><sup>+</sup> c *.*

Intuitively, an environment of a contract c is then a component such that when it is composed with an implementation of c, the composition will operate satisfactorily with respect to the guarantee of the contract.

We will now define the refinement relation, and the conjunction and composition operations, on contracts.

Definition 7 (Contract refinement). *A contract* c refines *contract* c *, denoted* c c *, iff* ρ<sup>−</sup> <sup>c</sup> ρ<sup>−</sup> <sup>c</sup> *and* ρ<sup>+</sup> <sup>c</sup> <sup>ρ</sup><sup>+</sup> <sup>c</sup> *, where is the partial order relation defined in Section 3.1.*

The refinement relation reflects the intention that if a contract c refines another contract c , then any component implementing c should also implement c .

Definition 8 (Contract conjunction). *The* conjunction *of two contracts* c<sup>1</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>1</sup> , ρ<sup>+</sup> <sup>c</sup><sup>1</sup> ) *and* c<sup>2</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>2</sup> , ρ<sup>+</sup> <sup>c</sup><sup>2</sup> ) *is the contract* c<sup>1</sup> ∧ c<sup>2</sup> def = (ρ<sup>−</sup> <sup>c</sup><sup>1</sup> ρ<sup>−</sup> <sup>c</sup><sup>2</sup> , ρ<sup>+</sup> <sup>c</sup><sup>1</sup> <sup>ρ</sup><sup>+</sup> <sup>c</sup><sup>2</sup> )*, where and are the* lub *and* glb *operations of the lattice, respectively.*

This definition is consistent with the intention that any contract that refines c<sup>1</sup> ∧ c<sup>2</sup> should also refine c<sup>1</sup> and c<sup>2</sup> individually. The interface of c<sup>1</sup> ∧ c<sup>2</sup> is then I<sup>c</sup>1∧c<sup>2</sup> = (P <sup>−</sup> <sup>c</sup><sup>1</sup> ∪ P <sup>−</sup> <sup>c</sup><sup>2</sup> , P <sup>+</sup> <sup>c</sup><sup>1</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>2</sup> ). Note that while this is the interface in general, conjunction of contracts is typically used to merge different viewpoints of *the same* component, and in that case I<sup>c</sup><sup>1</sup> = I<sup>c</sup><sup>2</sup> = I<sup>c</sup>1∧c<sup>2</sup> .

Definition 9 (Contract composability). *Two contracts* c<sup>1</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>1</sup> , ρ<sup>+</sup> <sup>c</sup><sup>1</sup> ) *and* c<sup>2</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>2</sup> , ρ<sup>+</sup> <sup>c</sup><sup>2</sup> ) *with interfaces* I<sup>c</sup><sup>1</sup> = (P <sup>−</sup> <sup>c</sup><sup>1</sup> , P <sup>+</sup> <sup>c</sup><sup>1</sup> ) *and* I<sup>c</sup><sup>2</sup> = (P <sup>−</sup> <sup>c</sup><sup>2</sup> , P <sup>+</sup> <sup>c</sup><sup>2</sup> ) *are* composable *if: (i)* P <sup>+</sup> <sup>c</sup><sup>1</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>2</sup> <sup>=</sup> <sup>∅</sup>*, (ii)* <sup>∀</sup><sup>p</sup> <sup>∈</sup> <sup>P</sup> <sup>−</sup> <sup>c</sup><sup>1</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>2</sup> . ρ<sup>+</sup> <sup>c</sup><sup>2</sup> (p) ⊆ ρ<sup>−</sup> <sup>c</sup><sup>1</sup> (p)*, and (iii)* ∀p ∈ P <sup>−</sup> <sup>c</sup><sup>2</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>1</sup> . ρ<sup>+</sup> <sup>c</sup><sup>1</sup> (p) ⊆ ρ<sup>−</sup> <sup>c</sup><sup>2</sup> (p)*.*

The conditions for composability ensure that the mutual guarantees of the two contracts meet each other's assumptions.

Definition 10 (Contract composition). *The* composition *of two composable contracts* c<sup>1</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>1</sup> , ρ<sup>+</sup> <sup>c</sup><sup>1</sup> ) *and* c<sup>2</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>2</sup> , ρ<sup>+</sup> <sup>c</sup><sup>2</sup> )*, with interfaces* I<sup>c</sup><sup>1</sup> = (P <sup>−</sup> <sup>c</sup><sup>1</sup> , P <sup>+</sup> <sup>c</sup><sup>1</sup> ) *and* I<sup>c</sup><sup>2</sup> = (P <sup>−</sup> <sup>c</sup><sup>2</sup> , P <sup>+</sup> <sup>c</sup><sup>2</sup> )*, respectively, is the contract* c<sup>1</sup> ⊗ c<sup>2</sup> def = (ρ<sup>−</sup> <sup>c</sup>1⊗c<sup>2</sup> , ρ<sup>+</sup> <sup>c</sup><sup>1</sup> <sup>ρ</sup><sup>+</sup> <sup>c</sup><sup>2</sup> )*, where:*

$$\rho\_{c\_1 \otimes c\_2}^{-} \stackrel{\text{def}}{=} (\rho\_{c\_1}^{-} \sqcap \rho\_{c\_2}^{-}) \big|\_{\left(P\_{c\_1}^{-} \cup P\_{c\_2}^{-}\right) \backslash \left(P\_{c\_1}^{+} \cup P\_{c\_2}^{+}\right)}$$

The interface of c<sup>1</sup> ⊗ c<sup>2</sup> is I<sup>c</sup>1⊗c<sup>2</sup> = ((P <sup>−</sup> <sup>c</sup><sup>1</sup> ∪ P <sup>−</sup> <sup>c</sup><sup>2</sup> ) \ (<sup>P</sup> <sup>+</sup> <sup>c</sup><sup>1</sup> <sup>∪</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>2</sup> ), P <sup>+</sup> <sup>c</sup><sup>1</sup> <sup>∪</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup><sup>2</sup> ).

Theorem 2. *For any composable contracts* c<sup>1</sup> *and* c2*, and any implementations* m<sup>1</sup> |= c<sup>1</sup> *and* m<sup>2</sup> |= c2*,* m<sup>1</sup> *and* m<sup>2</sup> *are composable, and* c<sup>1</sup> ⊗ c<sup>2</sup> *is the least contract (w.r.t. refinement order) for which the following properties hold:*

*(i)* m<sup>1</sup> × m<sup>2</sup> |= c<sup>1</sup> ⊗ c2*,*

*(ii) if* m *is an environment to* c<sup>1</sup> ⊗ c2*, then* m<sup>1</sup> × m *is an environment to* c2*, (iii) if* m *is an environment to* c<sup>1</sup> ⊗ c2*, then* m × m<sup>2</sup> *is an environment to* c1*.*

### 5 Connection to Meta-Theory

In this section we show that the abstract contract theory presented in Section 4 instantiates the meta-theory described in Section 2.2.

In our instantiation of the meta-theory, we consider as the abstract component universe <sup>M</sup> the same universe of components <sup>M</sup> as defined in Section 4.1. To distinguish the contracts of the meta-theory from those of the abstract theory, we shall always denote the former by C and the latter by c. Recall that a contract C is a pair (E,M), where E,M ⊆ M. The formal connection between the two notions is established with the following definition.

Definition 11 (Induced contract). *Let* c *be a denotational contract. It* induces *the contract* C<sup>c</sup> = (Ec, Mc)*, where* E<sup>c</sup> def = {m ∈M| m *is an environment for* c} *and* M<sup>c</sup> def = {m ∈M| m |= c}*.*

Since contract implementation requires that the implementing component's provided functions are a subset of the contract's provided functions, every component m such that P <sup>+</sup> <sup>m</sup> <sup>∩</sup> <sup>P</sup> <sup>+</sup> <sup>c</sup> = ∅ is composable with every component in Mc.

The definitions of implementation, refinement and conjunction of denotational contracts make this straightforward definition of induced contracts possible, so that it directly results in refinement as set membership and conjunction as lub w.r.t. the refinement order.

Theorem 3. *The contract theory of Section 4 instantiates the meta-theory of Benveniste et al. [5], in the sense that composition of components is associative and commutative, and for any two contracts* c<sup>1</sup> *and* c2*:*

*(i)* c<sup>1</sup> c<sup>2</sup> *iff* C<sup>c</sup><sup>1</sup> *refines* C<sup>c</sup><sup>2</sup> *according to the definition of the meta-theory,*

*(ii)* C<sup>c</sup>1∧c<sup>2</sup> *is the conjunction of* C<sup>c</sup><sup>1</sup> *and* C<sup>c</sup><sup>2</sup> *as defined in the meta-theory, and (iii)* C<sup>c</sup>1⊗c<sup>2</sup> *is the composition of* C<sup>c</sup><sup>1</sup> *and* C<sup>c</sup><sup>2</sup> *as defined in the meta-theory.*

The proof is straightforward, since many definitions of the contract theory are deliberately similar to their counterparts in the meta-theory.

Let us now return to our example from Section 3. When applying Contract Based Design, contracts at the more abstract level will be decomposed into contracts at the more concrete level. So, for our example, we might have at the top level a contract c = (ρ<sup>−</sup> <sup>c</sup> , ρ<sup>+</sup> <sup>c</sup> ) with interface (∅, {even, odd}), where <sup>ρ</sup><sup>−</sup> <sup>c</sup> = ∅, and where ρ<sup>+</sup> <sup>c</sup> ∈ **Env**<sup>P</sup> <sup>+</sup> <sup>c</sup> maps *even* to the set of pairs (s, s ) such that whenever s(n) is non-negative and even, then s (r)=1, and when s(n) is non-negative and odd, then s (r)=0, and maps *odd* in a dual manner. This contract could then be decomposed into two contracts c*even* and c*odd* , so that ρ<sup>+</sup> <sup>c</sup>*even* (*even*) def = ρ<sup>+</sup> <sup>c</sup> (*even*) and ρ<sup>−</sup> <sup>c</sup>*even* (*odd*) def = ρ<sup>+</sup> <sup>c</sup> (*odd*), and c*odd* is analogous. Then, we would have c*even* ⊗ c*odd* c, and for any two components m*even* and m*odd* such that m*even* |= c*even* and m*odd* |= c*odd* , it would hold that m*even* × m*odd* |= c.

#### 6 Connection to Programs with Procedures

In this section we discuss how our abstract contract theory from Section 4 relates to programs with procedures as presented in Section 3.1, and how it relates to Hoare logic and procedure-modular verification as presented in Section 3.2.

First, we define how to abstract the denotational notion of procedures into components in the abstract theory, based on the function ξ from Section 3.1.

Definition 12 (From procedure sets to components). *For any set of procedures* P <sup>+</sup>*, calling procedures* P *, we define the component* m : **Env**<sup>P</sup> <sup>−</sup> <sup>m</sup> → **Env**<sup>P</sup> <sup>+</sup> <sup>m</sup> *, where* P <sup>−</sup> m def <sup>=</sup> <sup>P</sup> \ <sup>P</sup> <sup>+</sup> <sup>m</sup> *and* P <sup>+</sup> m def <sup>=</sup> <sup>P</sup> <sup>+</sup>*, so that* <sup>∀</sup>ρ<sup>−</sup> <sup>m</sup> ∈ **Env**<sup>P</sup> <sup>−</sup> <sup>m</sup> . ∀p ∈ P <sup>+</sup> m. m(ρ<sup>−</sup> <sup>m</sup>)(p) def = [[Sp]]<sup>ρ</sup><sup>−</sup> m*.*

As the next result shows, procedure set abstraction and component composition commute. Together with commutativity and associativity of component composition, this means that the initial grouping of procedures into components is irrelevant, and that one can start with abstracting each individual procedure into a component.

Theorem 4. *For any two disjoint sets of procedures* P <sup>+</sup> <sup>1</sup> *and* P <sup>+</sup> <sup>2</sup> *, abstracted individually into components* m<sup>1</sup> *and* m2*, respectively, and* P <sup>+</sup> <sup>1</sup> <sup>∪</sup> <sup>P</sup> <sup>+</sup> <sup>2</sup> *abstracted into component* m*, it holds that* m<sup>1</sup> × m<sup>2</sup> = m*.*

The result is a direct consequence of Definition 12, Definition 3, and the well-known Bekić's Lemma [3] about simultaneous fixed-points.

*Component abstraction example.* Let us illustrate the theorem on our even-odd example (however, the example does not really illustrate Bekić's Lemma, since the two procedures do not call themselves).

By Definition 12, the procedure set {*even*} is abstracted into component m*even* : **Env**{odd} → **Env**{even} with interface ({odd}, {even}), so that ∀ρ<sup>−</sup> ∈ **Env**{odd}. m(ρ−)(*even*) = [[S*even*]]<sup>ρ</sup><sup>−</sup> . By definition, [[S*even*]]<sup>ρ</sup><sup>−</sup> is equal to [[S*even*]]<sup>ρ</sup><sup>+</sup> 0 <sup>ρ</sup><sup>−</sup> , where <sup>ρ</sup><sup>+</sup> <sup>0</sup> is the least fixed point of ξ : **Env**{*even*} → **Env**{*even*} defined by ξ(ρ<sup>+</sup>)(*even*) def = [[S*even*]]<sup>ρ</sup><sup>+</sup> <sup>ρ</sup><sup>−</sup> for any <sup>ρ</sup><sup>+</sup> <sup>∈</sup> **Env**{even}. Notice, however, that procedure *even* does not have any calls to itself, so [[S*even*]]<sup>ρ</sup><sup>+</sup> 0 <sup>ρ</sup><sup>−</sup> does not really depend on <sup>ρ</sup><sup>+</sup>. Then, for any <sup>ρ</sup><sup>−</sup> <sup>∈</sup> **Env**{odd}, (s, s ) ∈ m(ρ−)(*even*) if either s(n)=0 and s = s[r '→ 1], or else if s(n) > 0 and (s[n '→ s(n)−1], s ) ∈ ρ−(*odd*).

Similarly, the procedure set {*odd*} is abstracted into component m*odd* : **Env**{even} → **Env**{odd} with interface ({even}, {odd}), so that ∀ρ<sup>−</sup> ∈ **Env**{even}. m(ρ−)(*odd*) = [[S*odd* ]]<sup>ρ</sup><sup>−</sup> . Then, for any ρ<sup>−</sup> ∈ **Env**{even}, (s, s ) ∈ m(ρ−)(*odd*) if either s(n)=0 and s = s[r '→ 0], or else if s(n) > 0 and (s[n '→ s(n) − 1], s ) ∈ ρ−(*even*).

Now, applying Definition 12 to the whole (closed) program yields a component <sup>m</sup> : **Env**<sup>∅</sup> <sup>→</sup> **Env**{*even*,*odd*} with interface (∅, {*even*, *odd*}), so that ∀ρ<sup>−</sup> ∈ **Env**∅. ∀p ∈ {*even*, *odd*} . m(ρ−)(p) = [[Sp]]<sup>ρ</sup><sup>−</sup> . Recall the denotations [[S*even*]]<sup>ρ</sup><sup>−</sup> and [[S*odd* ]]<sup>ρ</sup><sup>−</sup> from the end of Section 3.1.

Components m*even* and m*odd* are composable, and by Definition 3, their composition has (the same) interface (∅, {*even*, *odd*}), and is (also) a mapping m*even* × m*odd* : **Env**<sup>∅</sup> → **Env**{*even*,*odd*}.

Finally, note that function χ<sup>+</sup> <sup>m</sup>*even*×m*odd* : **Env**{*even*,*odd*} <sup>→</sup> **Env**{*even*,*odd*} is exactly the function <sup>ξ</sup> in the context of the interface (∅, {*even*, *odd*}). This can be seen by first noting that since **Env**<sup>∅</sup> = ∅, we have that χ<sup>+</sup> m*even*×m*odd* only depends on its arguments. Furthermore, for all <sup>ρ</sup><sup>+</sup> <sup>∈</sup> **Env**{*even*,*odd*}, if ρ+ *odd* def = ρ<sup>+</sup> {odd} and ρ<sup>+</sup> *even* def = ρ<sup>+</sup> {even} we have that, since *odd* <sup>∈</sup> <sup>P</sup> <sup>−</sup> *even* <sup>∩</sup> <sup>P</sup> <sup>+</sup> *odd* , then χ<sup>+</sup> <sup>m</sup>*even*×m*odd* (ρ<sup>+</sup>)(*even*) = <sup>m</sup>*even*(ρ<sup>+</sup> *odd* )(*even*) = [[S*even*]]ρ<sup>+</sup> *odd* = [[S*even*]]<sup>ρ</sup><sup>+</sup> = ξ(ρ<sup>+</sup>)(*even*). Similarly χ<sup>+</sup> <sup>m</sup>*even*×m*odd* (ρ<sup>+</sup>)(*odd*) = <sup>ξ</sup>(ρ<sup>+</sup>)(*odd*). We therefore have m*even* × m*odd* = m.

We now define how to abstract Hoare logic contracts into denotational contracts, in terms of the contract environment ρ<sup>c</sup> defined in Section 3.2.

Definition 13 (From Hoare logic contracts to denotational contracts). *For a procedure* p *with Hoare logic contract* Cp*, calling other procedures* P <sup>−</sup>*, we define the denotational contract* c<sup>p</sup> = (ρ<sup>−</sup> <sup>c</sup><sup>p</sup> , ρ<sup>+</sup> <sup>c</sup><sup>p</sup> ) *with interface* P <sup>+</sup> cp def = {p} *and* P <sup>−</sup> cp def = P <sup>−</sup>*, so that* ρ<sup>+</sup> <sup>c</sup><sup>p</sup> (p) def = ρc(p)*, and* ∀p ∈ P <sup>−</sup>. ρ<sup>−</sup> <sup>c</sup><sup>p</sup> (p ) = ρc(p )*.*

In this way, conceptually, denotational contracts become assume/guaranteestyle specifications over Hoare logic procedure contracts: assuming that all (external) procedures called by a procedure p transform the state according to their Hoare logic contracts, procedure p obliges itself to do so as well.

We now show that if a procedure implements a Hoare logic contract, then the abstracted component will implement the abstracted contract, and vice versa. Together with Theorem 4, this result allows the *procedure-modular verification* of abstract components.

Theorem 5. *For any procedure* p *with procedure contract* Cp*, abstracted into component* <sup>m</sup><sup>p</sup> *with contract* <sup>c</sup>p*, we have* <sup>S</sup><sup>p</sup> <sup>|</sup>=*cr par* C<sup>p</sup> *iff* m<sup>p</sup> |= cp*.*

The result follows mainly from Definitions 12 and 13, and the denotational semantics given in Section 3.

Returning to the example from Sections 3 and 5, we can abstract the procedure set {even} into component m*even*, with interface ({odd}, {even}), which would be a function **Env**{odd} → **Env**{even}, and ∀ρ<sup>−</sup> ∈ **Env**{odd}. m(ρ−)(*even*) = [[S*even*]]<sup>ρ</sup><sup>−</sup> . The denotational contracts c*even* and c*odd* resulting from the decomposition shown in Section 5, would be exactly the abstraction of the Hoare Logic contracts C*even* and C*odd* shown in Section 3.2. They would both be part of the contract environment used in procedure-modular verification, for example when verifying that <sup>S</sup>*even* <sup>|</sup>=*cr par* C*even*, which would entail m*even* |= c*even*. Thus, by applying standard procedure-modular verification at the source code level, we prove the top-level contract c proposed in Section 5.

#### 7 Conclusion

We presented an abstract contract theory for procedural languages, based on denotational semantics. The theory is shown to be an instance of the meta-theory of [5], and at the same time an abstraction of the standard denotational semantics of procedural languages. We believe that our contract theory can be used to support the development of cyber-physical and embedded systems by the design methodology supported by the meta-theory, allowing the individual procedures of the embedded software to be treated as any other system component. The work also strengthens the claims of the meta-theory of distilling the notion of contracts to its essence, by showing that it is applicable also in the context of procedural programs and deductive verification. Finally, this work serves as a preparation for combining our contract theory for procedural programs with other instantiations of the meta-theory. In future work we plan to investigate the utility of our contract theory on real embedded systems taken from the automotive industry, where not all components are procedural programs, or even software (cf. our previous work, e.g., [11]). We also plan to extend our toy imperative language with additional features, such as procedure parameters and return values. Furthermore, we plan to extend the contract theory to capture program traces by developing a finite-trace semantics, to enable its use in the specification and verification of temporal properties. Lastly, we plan to combine our contract theory with an existing contract theory for hybrid systems [20].

# References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Paracosm: A Test Framework for Autonomous Driving Simulations**

Rupak Majumdar<sup>1</sup> , Aman Mathur<sup>1</sup> -, Marcus Pirron<sup>1</sup> , Laura Stegner<sup>2</sup> , and Damien Zufferey<sup>1</sup>

<sup>1</sup> MPI-SWS, Kaiserslautern, Germany {rupak, mathur, mpirron,

zufferey}@mpi-sws.org <sup>2</sup> University of Wisconsin, Madison, USA stegner@wisc.edu

**Abstract.** Systematic testing of autonomous vehicles operating in complex real-world scenarios is a difficult and expensive problem. We present Paracosm, a framework for writing systematic test scenarios for autonomous driving simulations. Paracosm allows users to programmatically describe complex driving situations with specific features, e.g., road layouts and environmental conditions, as well as reactive temporal behaviors of other cars and pedestrians. A systematic exploration of the state space, both for visual features and for reactive interactions with the environment is made possible. We define a notion of test coverage for parameter configurations based on combinatorial testing and low dispersion sequences. Using fuzzing on parameter configurations, our automatic test generator can maximize coverage of various behaviors and find problematic cases. Through empirical evaluations, we demonstrate the capabilities of Paracosm in programmatically modeling parameterized test environments, and in finding problematic scenarios.

**Keywords:** Autonomous driving · Testing · Reactive programming.

### **1 Introduction**

Building autonomous driving systems requires complex and intricate engineering effort. At the same time, ensuring their reliability and safety is an extremely difficult task. There are serious public safety and trust concerns [63], aggravated by recent accidents involving autonomous cars [48]. Software in such vehicles combine well-defined tasks such as trajectory planning, steering, acceleration and braking, with underspecified tasks such as building a semantic model of the environment from raw sensor data and making decisions using this model. Unfortunately, these underspecified tasks are critical to the safe operation of autonomous vehicles. Therefore, testing in large varieties of realistic scenarios is the only way to build confidence in the correctness of the overall system.

Running real tests is a necessary, but slow and costly process. It is difficult to reproduce corner cases due to infrastructure and safety issues; one can neither run over pedestrians to demonstrate a failing test case, nor wait for specific weather and road conditions. Therefore, the automotive industry tests

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 172–195, 2021. https://doi.org/10.1007/978-3-030-71500-7 9

Fig. 1: A Paracosm program consists of parameterized reactive components such as the test vehicle, the environment, road networks, other actors and their behaviors, and monitors. The test input generation scheme guarantees good coverage over the parameter space. The test scenario depicted here shows a test vehicle stopping for a jaywalking pedestrian.

autonomous systems in virtual simulation environments [21, 26, 53, 61, 68, 72]. Simulation reduces the cost per test, and more importantly, gives precise control over all aspects of the environment, so as to test corner cases.

A major limitation of current tools is the lack of customizability: they either provide a GUI-based interface to design an environment piece-by-piece, or focus on bespoke pre-made environments. This makes the setup of varied scenarios difficult and time consuming. Though exploiting parametricity in simulation is useful and effective [10,23,31,67], the cost of environment setup, and navigating large parameter spaces, is quite high [31]. Prior works have used bespoke environments with limited parametricity. More recently, programmatic interfaces have been proposed [27] to make such test procedures more systematic. However, the simulated environments are largely still fixed, with no dynamic behavior.

In this work, we present Paracosm, a programmatic interface that enables the design of *parameterized environments* and *test cases*. Test parameters control the environment and the behaviors of the actors involved. Paracosm supports various test input generation strategies, and we provide a notion of coverage for these. Rather than computing coverage over intrinsic properties of the system under test (which is not yet understood for neural networks [39]), our coverage criteria is over the space of test parameters. Figure 1 depicts the various parts of a Paracosm test. A Paracosm program represents a family of tests, where each instantiation of the program's parameters is a concrete test case.

Paracosm is based on a synchronous reactive programming model [13, 35, 40,70]. Components, such as road segments or cars, receive streams of inputs and produce streams of outputs over time. In addition, components have graphical assets to describe their appearance for an underlying visual rendering engine and physical properties for an underlying physics simulator. For example, a vehicle in Paracosm not only has code that reads in sensor feeds and outputs steering angle or braking, but also has a textured mesh representing its shape, position

and orientation in 3D space, and a physics model for its dynamical behavior. A Paracosm configuration consists of a composition of several components. Using a set of system-defined components (road segments, cars, pedestrians, etc.) combined using expressive operations from the underlying reactive programming model, users can set up complex temporally varying driving scenarios. For example, one can build an urban road network with intersections, pedestrians and vehicular traffic, and parameterize both, environment conditions (lighting, fog), and behaviors (when a pedestrian crosses a street).

Streams in the world description can be left "open" and, during testing, Paracosm automatically generates sequences of values for these streams. We use a coverage strategy based on k*-wise combinatorial coverage* [14, 38] for discrete variables and *dispersion* for continuous variables. Intuitively, k-wise coverage ensures that, for a programmer-specified parameter k, all possible combinations of values of any k discrete parameters are covered by tests. Low dispersion [57] ensures that there are no "large empty holes" left in the continuous parameter space. Paracosm uses an automatic test generation strategy that offers high coverage based on random sampling over discrete parameters and *deterministic* quasi-Monte Carlo methods for continuous parameters [49, 57].

Like many of the projects referenced before, our implementation performs simulations inside a game engine. However, Paracosm configurations can also be output to the OpenDRIVE format [7] for use with other simulators, which is more in-line with the current industry standard. We demonstrate through various case studies how Paracosm can be an effective testing framework for both qualitative properties (crash) and quantitative properties (distance maintained while following a car, or image misclassification).

Our main contributions are the following: (I) We present a programmable and expressive framework for programmatically modeling complex and parameterized scenarios to test autonomous driving systems. Using Paracosm one can specify the environment's layout, behaviors of actors, and expose parameters to a systematic testing infrastructure. (II) We define a notion of test coverage based on combinatorial k-wise coverage in discrete space and low dispersion in continuous space. We show a test generation strategy based on fuzzing that theoretically guarantees good coverage. (III) We demonstrate empirically that our system is able to express complex scenarios and automatically test autonomous driving agents and find incorrect behaviors or degraded performance.

### **2 Paracosm through Examples**

We now provide a walkthrough of Paracosm through a testing example. Suppose we have an autonomous vehicle to test. Its implementation is wrapped into a parameterized class:

```
AutonomousVehicle (start , model , controller) {
    void run(...) { ... } }
```
where the model ranges over possible car models (appearance, physics), and the controller implements an autonomous controller. The goal is to test this class in many different driving scenarios, including different road networks, weather and light conditions, and other car and pedestrian traffic. We show how Paracosm enables writing such tests as well as generate test inputs automatically.

A *test configuration* consists of a composition of *reactive objects*. The following is an outline of a test configuration in Paracosm, in which the autonomous vehicle drives on a road with a pedestrian wanting to cross. We have simplified the API syntax for the sake of clarity and omit the enclosing Test class. In the code segments, we use ':' for named arguments.

```
1 // Test parameters
2 light = VarInterval (0.2, 1.0) // value in [0.2, 1.0]
3 nlanes = VarEnum ({2,4,6}) // value is 2, 4 or 6
4 // Description of environment
5 w = World (light:light , fog:0)
6 // Create a road segment
7 r = StraightRoadSegment (len:100, nlanes:nlanes)
8 // The autonomous vehicle controlled by the SUT
9 v = AutonomousVehicle (start:...,model:...,controller :...)
10 // Some other actor(s)
11 p = Pedestrian(start:.., model:..., ...)
12 // Monitor to check some property
13 c = CollisionMonitor(v)
14 // Place elements in the world
15 run_test (env: {w, r, v, p}, test_params: {light , nlanes},
      monitors: {c}, iterations: 100)
```
An instantiation of the reactive objects in the test configuration gives a *scene* all the visual elements present in the simulated world. A *test case* provides concrete inputs to each "open" input stream in a scene. A test case determines how the scene evolves over time: how the cars and pedestrians move and how environment conditions change. We go through each part of the test configuration in detail below.

*Reactive Objects.* The core abstraction of Paracosm is a *reactive object*. Reactive objects capture geometric and graphical features of a physical object, as well as their behavior over time. The behavioral interface for each reactive object has a set of *input* streams and a set of *output* streams. The evolution of the world is computed in steps of fixed duration which corresponds to events in a predefined tick stream. For streams that correspond to physical quantities updated by the physics simulator, such as position and speeds of cars, etc., appropriate events are generated by the underlying physics simulator.

Input streams provide input values from the environment over time; output streams represent output values computed by the object. The object's constructor sets up the internal state of the object. An object is updated by event triggered computations. Paracosm provides a set of assets as base classes. Autonomous driving systems naturally fit reactive programming models. They consume sensor input streams and produce actuator streams for the vehicle model. We differentiate between static *environment* reactive objects (subclassing

Fig. 2: Reactive streams represented by a marble diagram. A change in the value of test parameters nlanes or light changes the environment, and triggers a change in the corresponding sensor (output) stream camera.

Geometric) and dynamic *actor* reactive objects (subclassing Physical). Environment reactive objects represent "static" components of the world, such as road segments, intersections, buildings or trees, and a special component called the *world*. Actor reactive objects represent components with "dynamic" behavior: vehicles or pedestrians. The world object is used to model features of the world such as lighting or weather conditions. Reactive objects can be *composed* to generate complex assemblies from simple objects. The composition process can be used to connect static components structurally–such as two road segments connecting at an intersection. Composition also connects the behavior of an object to another by binding output streams to input streams. At run time, the values on that input stream of the second object are obtained from the output values of the first. Composition must respect geometric properties—the runtime system ensures that a composition maintains invariants such as no intersection of geometric components. We now describe the main features in Paracosm, centered around the test configuration above.

*Test Parameters.* Using test variables, we can have general, but constrained streams of values passed into objects [59]. Our automatic test generator can then pick values for these variables, thereby leading to different test cases (see Figure 2). There are two types of parameters: continuous (VarInterval) and discrete (VarEnum). In the example presented, light (light intensity) is a continuous test parameter and nlanes (number of lanes) is discrete.

*World.* The World is a pre-defined reactive object in Paracosm with a visual representation responsible for atmospheric conditions like the light intensity, direction and color, fog density, etc. The code segment

```
w = World (light:light , fog:0)
```
parameterizes the world using a test variable for light and sets the fog density to a constant (0).

*Road Segments.* In our example, StraightRoadSegment was parameterized with the number of lanes. In general, Paracosm provides the ability to build complex road networks by connecting primitives of individual road segments and intersections. (A detailed example is presented in our Technical Report [43].)

It may seem surprising that we model static scene components such as roads as reactive objects. This serves two purposes. First, we can treat the number of lanes in a road segment as a constant input stream that is set by the test case, allowing parameterized test cases. Second, certain features of static objects can also change over time. For example, the coefficient of friction on a road segment may depend on the weather condition, which can be a function of time.

*Autonomous Vehicles & System Under Test (SUT).* AutonomousVehicle, as well as other actors, extends the Physical class (which in turn subclasses Geometric). This means that these objects have a visual as well as a physical model. The visual model is essentially a textured 3D mesh. The physical model contains properties such as mass, moments of inertia of separate bodies in the vehicle, joints, etc. This is used by the physics simulator to compute the vehicle's motion in response to external forces and control input. In the following code segment, we instantiate and place our test vehicle on the road:

v = AutonomousVehicle (start:r.onLane(1, 0.1), model: CarAsset(...), controller:MyController(...))

The start parameter "places" the vehicle in the world (in relative coordinates). The model parameter provides the implementation of the geometric and physical model of the vehicle. The controller parameter implements the autonomous controller under test. The internals of the controller implementation are not important; what is important is its interface (sensor inputs and the actuator outputs). These determine the input and output streams that are passed to the controller during simulation. For example, a typical controller can take sensor streams such as image streams from a camera as input and produce throttle and steering angles as outputs. The Paracosm framework "wires" these streams appropriately. For example, the rendering engine determines the camera images based on the geometry of the scene and the position of the camera and the controller outputs are fed to the physics engine to determine the updated scene. Though simpler systems like openpilot [15] use only a dashboard-mounted camera, autonomous vehicles can, in general, mix cameras at various mount points, LiDARs, radars, and GPS. Paracosm can emulate many common types of sensors which produce streams of data. It is also possible to integrate new sensors, which are not supported out-of-the-box, by implementing them using the game engine's API.

*Other Actors.* A test often involves many actors such as pedestrians, and other (non-test) vehicles. Apart from the standard geometric (optionally physical) properties, these can also have some pre-programmed behavior. Behaviors can either be only dependent on the starting position (say, a car driving straight on the same lane), or be dynamic and reactive, depending on test parameters and behaviors of other actors. In general, the reactive nature of objects enables complex scenarios to be built. For example, here, we specify a simple behavior of a pedestrian crossing a road.The pedestrian starts crossing the road when a car is a certain distance away. In the code segments below, we use '\_' as shorthand for a lamdba expression, i.e., "f(\_)" is the same as "x => f(x)".

```
Pedestrian( value start , value target , carPos , value dist ,
   value speed) extends Geometric {
  ... // Initialization
  // Generate an event when the car gets close
  trigger = carPos. Filter ( abs (_ - start) < dist )
  // target location reached
  done = pos. Filter ( _ == target )
  // Walk to the target after trigger fires
  tick . SkipUntil (trigger). TakeUntil (done). foreach ( ... /*
      walk with given speed */ )
}
```
*Monitors and Test Oracles.* Paracosm provides an API to provide qualitative and quantitative temporal specifications. For instance, in the following example, we check that there is no collision and ensure that the collision was not trivially avoided because our vehicle did not move at all.

```
// no collision
CollisionMonitor( AutonomousVehicle v) extends Monitor {
  assert (v.collider. IsEmpty ()) }
// cannot trivially pass the test by staying put
DistanceMonitor( AutonomousVehicle v , value minD) extends
   Monitor {
  pOld = v.pos. Take (1). Concat (v.pos)
  D = v.pos.Zip(pOld).Map( abs(_ - _) ).Sum()
  assert (D >= minD)
}
```
The ability to write monitors which read streams of system-generated events provides an expressive framework to write temporal properties, something that has been identified as a major limitation of prior tools [31]. Monitors for metric and signal temporal logic specifications can be encoded in the usual way [18,33].

# **3 Systematic Testing of Paracosm Worlds**

### **3.1 Test Inputs and Coverage**

Worlds in Paracosm directly describe a parameterized family of tests. The testing framework allows users to specify various strategies to generate input streams for both, static, and dynamic reactive objects in the world.

*Test Cases.* A *test* of *duration* T executes a configuration of reactive objects by providing inputs to every open input stream in the configuration for T ticks. The inputs for each stream must satisfy const parameters and respect the range constraints from VarInterval and VarEnum. The runtime system manages the scheduling of inputs and pushing input streams to the reactive objects. Let In denote the set of all input streams, and In = In<sup>D</sup> ∪In<sup>C</sup> denote the partition of In into *discrete* streams and *continuous* streams respectively. Discrete streams take their value over a finite, discrete range; for example, the color of a car, the number of lanes on a road segment, or the position of the next pedestrian (left/right) are discrete streams. Continuous streams take their values in a continuous (bounded) interval. For example, the fog density or the speed of a vehicle are examples of continuous streams.

*Coverage.* In the setting of autonomous vehicle testing, one often wants to explore the state space of a parameterized world to check "how well" an autonomous vehicle works under various situations, both qualitatively and quantitatively. Thus, we now introduce a notion of coverage. Instead of structural coverage criteria such as line or branch coverage, our goal is to cover the parameter space. In the following, for simplicity of notation, we assume that all discrete streams take values from {0, 1}, and all continuous streams take values in the real interval [0, 1]. Any input stream over bounded intervals—discrete or continuous—can be encoded into such streams. For discrete streams, there are finitely many tests, since each co-ordinate is Boolean and there is a fixed number of co-ordinates. One can define the coverage as the fraction of the number of vectors tested to the total number of vectors. Unfortunately, the total number of vectors is very high: if each stream is constant, then there are already 2<sup>n</sup> tests for n streams. Instead, we consider the notion of k*-wise testing* from combinatorial testing [38]. In k-wise testing, we fix a parameter k, and ask that every interaction between every k elements is tested. Let us be more precise. Suppose that a test vector has N co-ordinates, where each co-ordinate can get the value 0 or 1. A set of tests A is a k*-wise covering family* if for every subset {i1, i2,...,ik}⊆{1,...,N} of co-ordinates and every vector v ∈ {0, 1} k , there is a test t ∈ A whose restriction to the i1,...,i<sup>k</sup> is precisely v.

For continuous streams, the situation is more complex: since any continuous interval has infinitely many points, each corresponding to a different test case, we cannot directly define coverage as a ratio (the denominator will be infinite). Instead, we define coverage using the notion of *dispersion* [49, 57]. Intuitively, dispersion measures the largest empty space left by a set of tests. We assume a (continuous) test is a vector in [0, 1]<sup>N</sup> : each entry is picked from the interval [0, 1] and there are N co-ordinates. Dispersion over [0, 1]<sup>N</sup> can be defined relative to sets of neighborhoods, such as N-dimensional balls or axis-parallel rectangles. Let us define B to be the family of N-dimensional axis-parallel rectangles in [0, 1]<sup>N</sup> , our results also hold for other notions of neighborhoods such as balls or ellipsoids. For a neighborhood B ∈ B, let *vol*(B) denote the volume of B. Given a set <sup>A</sup> <sup>⊆</sup> [0, 1]<sup>N</sup> of tests, we define the *dispersion* as the largest volume neighborhood in B without any test:

$$\mathsf{dispersion}(A) = \sup \left\{ \mathrm{vol}(B) \mid B \in \mathcal{B} \text{ and } A \cap B = \emptyset \right\}$$

A lower dispersion means better coverage.

Let us summarize. Suppose that a test vector consists of N<sup>D</sup> discrete coordinates and N<sup>C</sup> continuous co-ordinates; that is, a test is a vector (tD, t<sup>C</sup> ) in {0, 1} <sup>N</sup><sup>D</sup> <sup>×</sup> [0, 1]<sup>N</sup><sup>C</sup> . We say a set of tests <sup>A</sup> is (k, ε)*-covering* if


#### **3.2 Test Generation**

The goal of our default test generator is to maximize (k, ) for programmerspecified number of test iterations or ticks.

k*-Wise Covering Family.* One can use explicit construction results from combinatorial testing to generate k-wise covering families [14]. However, a simple way to generate such families with high probability is random testing. The proof is by the probabilistic method [4] (see also [44]). Let <sup>A</sup> be a set of 2<sup>k</sup>(<sup>k</sup> log <sup>N</sup> <sup>−</sup> log <sup>δ</sup>) uniformly randomly generated {0, 1} <sup>N</sup> vectors. Then A is a k-wise covering family with probability at least 1 − δ.

*Low Dispersion Sequences.* It is tempting to think that uniformly generating vectors from [0, 1]<sup>N</sup> would similarly give low dispersion sequences. Indeed, as the number of tests goes to infinity, the set of randomly generated tests has dispersion 0 almost surely. However, when we fix the number of tests, it is well known that uniform random sampling can lead to high dispersion [49,57]; in fact, one can show that the dispersion of n uniformly randomly generated tests grows asymptotically as O((log log n/n) 1 <sup>2</sup> ) almost surely. Our test generation strategy is based on *deterministic quasi-Monte Carlo sequences*, which have much better dispersion properties, asymptotically of the order of O(1/n), than the dispersion behavior of uniformly random tests. There are many different algorithms for generating quasi-Monte Carlo sequences deterministically (see, e.g., [49,57]). We use *Halton sequences*. For a given , we need to generate O( <sup>1</sup> ) inputs via Halton sampling. In Section 4.2, we compare uniform random and Halton sampling.

*Cost Functions and Local Search.* In many situations, testers want to optimize parameter values for a specific function. A simple example of this is finding higher-speed collisions, which intuitively, can be found in the vicinity of test parameters that already result in high-speed collisions. Another, slightly different case is (greybox) fuzzing [5, 55], for example, finding new collisions using small mutations on parameter values that result in the vehicle narrowly avoiding a collision. Our test generator supports such *quantitative* objectives and *local search*. A quantitative monitor evaluates a cost function on a run of a test case. Our test generation tool generates an initial, randomly chosen, set of test inputs. Then, it considers the scores returned by the Monitor on these samples, and performs a local search on samples with the highest/lowest scores to find local optima of the cost function.

#### **4 Implementation and Tests**

#### **4.1 Runtime System and Implementation**

Paracosm uses the Unity game engine [69] to render visuals, do runtime checks and simulate physics (via PhysX [16]). Reactive objects are built on top of UniRx [36], an implementation of the popular Reactive Extensions framework [56]. The game engine manages geometric transformations of 3D objects and offers easy to use abstractions for generating realistic simulations. Encoding behaviors and monitors, management of 3D geometry and dynamic checks are implemented using the game engine interface. The project code is available at: https://gitlab. mpi-sws.org/mathur/paracosm.

A simulation in Paracosm proceeds as follows. A test configuration is specified as a subclass of the EnvironmentProgramBaseClass.Tests are run by invoking the run\_test method, which receives as input the reactive objects that should be instantiated in the world as well as additional parameters relating to the test. The run\_test method runs the tests by first initializing and placing the reactive objects in the scene using their 3D mesh (if they have one) and then invoking a reactive engine to start the simulation. The system under test is run in a separate process and connects to the simulation. The simulation then proceeds until the simulation completion criteria is met (a time-out or some monitor event).

*Output to Standardized Testing Formats.* There have been recent efforts to create standardized descriptions of tests in the automotive industry. The most relevant formats are OpenDRIVE [7] and OpenSCENARIO (only recently finalized) [8]. OpenDRIVE describes road structures, and OpenSCENARIO describes actors and their behavior. Paracosm currently supports outputs to OpenDRIVE. Due to the static nature of the specification format, a different file is generated for each test iteration/configuration.

#### **4.2 Evaluation**

We evaluate Paracosm with respect to the following research questions (**RQ**s): **RQ 1**: Does Paracosm's programmatic interface enable the easy design of test environments and worlds?

**RQ 2**: Do the test input generation strategies discussed in Section 3 effectively explore the parameter space?

**RQ 3**: Can Paracosm help uncover poor performance or bad behavior of the SUT in common autonomous driving tasks?

*Methodology.* To answer **RQ 1**, we develop three independent environments rich with visual features and other actors, and use the variety generated with just a few lines of code as a proxy for ease of design. To answer **RQ 2**, we use coverage maximizing strategies for test inputs to all the three environments/case studies. We also use and evaluate cost functions and local search based methods. To answer **RQ 3**, we test various neural network based systems and demonstrate

Table 1: An overview of our case studies. Note that even though the Adaptive Cruise Control study has 2 discrete parameters, we calculate k-wise coverage for 3 as the 2 parameters require 3 bits for representation.


(a) A good test with all parameter values same as the training set (true positive: 89%, false positive: 0%).

(b) A bad test with all parameter values different from the training set (true positive: 9%, false positive: 1%).

Fig. 3: Example results from the road segmentation case study. Pixels with a green mask are segmented by the SUT as a road.

how Paracosm can help uncover problematic scenarios. A summary of the case studies presented here is available in Table 1. In our Technical Report [43], we present more case studies, specifically experiments on many pre-trained neural networks, busy urban environments and studies exploiting specific testing features of Paracosm.

### **4.3 Case Studies**

*Road segmentation* Using Paracosm's programmatic interface, we design a long road segment with several vehicles. The vehicular behavior is to drive on their respective lanes with a fixed maximum velocity. The test parameters are the number of lanes ({2, 4}), number of cars in the environment ({0, 5}) and light conditions ({Noon, Evening}). Noon lighting is much brighter than the evening. The direction of lighting is also the opposite. We test a deep CNN called VGGNet [62], that is known to perform well on several image segmentation benchmarks. The task is road segmentation, i.e., given a camera image, identifying which pixels correspond to the road. The network is trained on 191 dashcam images

Table 2: Summary of results of the road segmentation case study. Each combination of parameter values is presented separately, with the parameter values used for training in bold. We report the SUT's average true positive rate (% of pixels corresponding to the road that are correctly classified) and false positive rate (% of pixels that are not road, but incorrectly classified as road).


Table 3: Results for the jaywalking pedestrian case study.


captured in the test environment with fixed parameters (2 lanes, 5 cars, and Noon lighting), recorded at the rate of one image every 1/10th second, while manually driving the vehicle around (using a keyboard). We test on 100 images generated using Paracosm's default test generation strategy (uniform random sampling for discrete parameters). Table 2 summarizes the test results. Tests with parameter values far away from the training set are observed to not perform so well. As depicted in Figure 3, this happens because varying test parameters can drastically change the scene.

*Jaywalking pedestrian.* We now test over the environment presented in Section 2. The environment consists of a straight road segment and a pedestrian. The pedestrian's behavior is to cross the road at a specific walking speed when the autonomous vehicle is a specific distance away. The walking speed of the pedestrian and the distance of the autonomous vehicle when the pedestrian starts crossing the road are test parameters. The SUT is a CNN based on NVIDIA's behavioral cloning framework [12]. It takes camera images as input, and produces the relevant steering angle or throttle control as output. The SUT is trained on 403 samples obtained by driving the vehicle manually and recording the camera and corresponding control data. The training environment has pedestrians crossing

the road at various time delays, but always at a fixed walking speed (1 m/s). In order to evaluate **RQ 2** completely, we evaluate the default coverage maximizing sampling approach, as well as explore two quantitative objectives: first, maximizing the collision speed, and second, finding new failing cases around samples that *almost* fail. For the default approach, the CollisionMonitor as presented in Section 2 is used. For the first quantitative objective, this CollisionMonitor's code is prepended with the following calculation:

```
// Score is speed of car at time of collision
coll_speed = v.speed. CombineLatest (v.collider , (s,c) => s)
   . First ()
```
The score coll\_speed is used by the test generator for optimization. For the second quantitative objective, the CollisionMonitor is modified to give high scores to tests where the distance between the autonomous vehicle and pedestrian is very small:

```
CollisionMonitor( AutonomousVehicle v, Pedestrian p)
   extends Monitor {
    minDist = v.pos. Zip (p.pos).Map (1/ abs(_-_)). Min ()
    coll_score = v.collider. Map (0)
    // Score is either 0 (collision) or 1/minDist
    score = coll_score. DefaultIfEmpty (minDist)
    assert (v.collider. IsEmpty ())
}
```
We evaluate the following test input generation strategies: (i) Random sampling (ii) Halton sampling, (iii) Random or Halton sampling with local search for the two quantitative objectives. We run 100 iterations of each strategy with a 15 second timeout. For random or Halton sampling, we sample 100 times. For the quantitative objectives, we first generate 85 random or Halton samples, then choose the top 5 scores, and finally run 3 simulated annealing iterations on each of these 5 configurations. Table 3 presents results from the various test input generation strategies. Clearly, Halton sampling offers the lowest dispersion (highest coverage) over the parameter space. This can also be visually confirmed from the plot of test parameters (Figure 4). There are no big gaps in the parameter space. Moreover, we find that test strategies optimizing for the first objective are successful in finding more collisions with higher speeds. As these techniques perform simulated annealing repetitions on top of already failing tests, they also find more failing tests overall. Finally, test strategies using the second objective are also successful in finding more (newer) failure cases than simple Random or Halton sampling.

*Adaptive Cruise Control.* We now create and test in an environment with our test vehicle following a car (lead car) on the same lane. The lead car's behavior is programmed to drive on the same lane as the test vehicle, with a certain maximum speed. This is a very typical driving scenario that engineers test their implementations on. We use 5 test parameters: the initial lead of the lead car to

Fig. 4: A comparison of the various test generation strategies for the jaywalking pedestrian case study. The X-axis is the walking speed of the pedestrian (2 to 10 m/s). The Y-axis is the distance from the car when the pedestrian starts crossing (30 to 60 m). Passing tests are labelled with a green dot. Failing tests (tests with a collision) are marked with a red cross.

the test vehicle ([8m, 40m]), the lead car's maximum speed ([3m/s, 8m/s]), density of fog<sup>3</sup> in the environment ([0, 1]), number of lanes on the road ({2, <sup>4</sup>}), and color of the lead car ({Black, Red, Y ello, Blue}). We use both, CollisionMonitor <sup>4</sup> and DistanceMonitor, as presented in Section 2. A test *passes* if there is no collision and the autonomous vehicle moves atleast 5 m during the simulation duration (15 s).

We use Paracosm's default test generation strategy, i.e., Halton sampling for continuous parameters and Random sampling for discrete parameters (no optimization or fuzzing). The SUT is the same CNN as in the previous case study. It is trained on 1034 training samples, which are obtained by manually driving behind a red lead car on the same lane of a 2-lane road with the same maximum velocity (5.5 m/s) and no fog.

The results of this case study are presented in Table 4. Looking at the discrete parameters, the number of lanes does not seem to contribute towards a risk of collision. Surprisingly, though the training only involves a Red lead car, the results appear to be the best for a Blue lead car. Moving on to the continuous

<sup>3</sup> 0 denotes no fog and 1 denotes very dense fog (exponential squared scale).

<sup>4</sup> the monitor additionally calculates the mean distance of the test vehicle to the lead car during the test, which is used for later analysis.

Fig. 5: Continuous test parameters of the Adaptive Cruise Control study plotted against each other: the initial offset of the lead car (8 to 40 m), the lead car's maximum speed (3 to 8 m/s) and the fog density (0 to 1). Green dots, red crosses, and blue triangles denote passing tests, collisions, and inactivity respectively.

Table 4: Parameterized test on Adaptive Cruise Control, separated for each value of discrete parameters, and low and high values of continuous parameters. A test *passes* if there are no collisions and no inactivity (the overall distance moved by the test vehicle is more than 5 m. The average offset (in m) maintained by the test vehicle to the lead car (for passing tests) is also presented.


parameters, the fog density appears to have the most significant impact on test failures (collision or vehicle inactivity). In the presence of dense fog, the SUT behaves pessimistically and does not accelerate much (thereby causing a failure due to inactivity). These are all interesting and useful metrics about the performance of our SUT. Plots of the results projected on to continuous parameters are presented in Figure 5.

#### **4.4 Results and Analysis**

We now summarize the results of our evaluation with respect to our **RQ**s:

**RQ 1**: All the three case studies involve varied, rich and dynamic environments. They are representative of tests engineers would typically want to do, and we parameterize many different aspects of the world and the dynamic behavior of its components. These designs are at most 70 lines of code. This provides confidence in Paracosm's ability of providing an easy interface for the design of realistic test environments.

**RQ 2**: Our default test generation strategies are found to be quite effective at exploring the parameter space systematically, eliminating large unexplored gaps, and at the same time, successfully identifying problematic cases in all the three case studies. The jaywalking pedestrian study demonstrates that optimization and local search are possible on top of these strategies, and are quite effective in finding the relevant scenarios. The adaptive cruise control study tests over 5 parameters, which is more than most related works, and even guarantees good coverage of this parameter space. Therefore, it is amply clear that Paracosm's test input generation methods are useful and effective.

**RQ 3**: The road segmentation case study uses a well-performing neural network for object segmentation, and we are able to detect degraded performance for automatically generated test inputs. Whereas this study focuses on static image classification, the next two, i.e., the jaywalking pedestrian and the adaptive cruise control study uncover poor performance on simulated driving, using a popular neural network architecture for self driving cars. Therefore, we can safely conclude that Paracosm can find bugs in various different kinds of systems related to autonomous driving.

#### **4.5 Threats to Validity**

The *internal validity* of our experiments depends on having implemented our system correctly and, more importantly, trained and used the neural networks considered in the case studies correctly. For training the networks, we followed the available documentation and inspected our examples to ensure that we use an appropriate training procedure. We watched some test runs and replays of tests we did not understand. Furthermore, our implementation logs events and we also capture images, which allow us to check a large number of tests.

In terms of threats to external validity, the biggest challenge in this project has been finding systems that we can easily train and test in complex driving scenarios. Publicly available systems have limited capabilities and tend to be brittle. Many networks trained on real world data do not work well in simulation. We therefore re-train these networks in simulation. An alternative is to run fewer tests, but use more expensive and visually realistic simulations. Our test generation strategy maximizes coverage, even when only a few test iterations can be performed due to high simulation cost.

#### **5 Related Work**

Traditionally, test-driven software development paradigms [9] have advocated testing and mocking frameworks to test software early and often. Mocking frameworks and mock objects [42,47] allow programmers to test a piece of code against an API specification. Typically, mock objects are stubs providing outputs to explicitly provided lists of inputs of simple types, with little functionality of the actual code. Thus, they fall short of providing a rich environment for autonomous driving. Paracosm can be seen as a mocking framework for reactive, physical systems embedded in the 3D world. Our notion of constraining streams is inspired by work on declarative mocking [59].

*Testing Cyber-Physical Systems.* There is a large body of work on automated test generation tools for cyber-physical systems through heuristic search of a high-dimensional continuous state space. While much of this work has focused on low-level controller interfaces [6,17,19,20,25,60] rather than the system level, specification and test generation techniques arising from this work—for example, the use of metric and signal temporal logics or search heuristics—can be adapted to our setting. More recently, test generation tools have started targeting autonomous systems under a simulation-based semantic testing framework similar to ours. In most of these works, visual scenarios are either fixed by hand [1, 2, 10, 22, 27, 29, 66, 67], or are constrained due to the model or coverage criteria [3, 45, 50]. These analyses are shown to be preferable to the application of random noise on the input vector. Additionally, a simulation-based approach filters benign misclassifications from misclassifications that actually lead to bad or dangerous behavior. Our work extends this line of work and provides an expressive language to design parameterized environments and tests. AsFault [29] uses random search and mutation for procedural generation of road networks for testing. AC3R [28] reconstructs test cases from accident reports.

To address problems of high time and infrastructure cost of testing autonomous systems, several simulators have been developed. The most popular is Gazebo [26] for the ROS [54] robotics framework. It offers a modular and extensible architecture, however falls behind on visual realism and complexity of environments that can be generated with it. To counter this, game engines are used. Popular examples are TORCS [72], CARLA [21], and AirSim [61] Modern game engines support creation of realistic urban environments. Though they enable visually realistic simulations, and enable detection of infractions such as collisions, the environments themselves are difficult to design. Designing a custom environment involves manual placement of road segments, buildings, and actors (as well as their properties). Performing many systematic tests is therefore time-consuming and difficult. While these systems and Paracosm share the same aims and much of the same infrastructure, Paracosm focuses on procedural design and systematic testing, backed by a relevant coverage criteria.

*Adversarial Testing.* Adversarial examples for neural networks [32,64] introduce perturbations to inputs that cause a classifier to classify "perceptually identical" inputs differently. Much work has focused on finding adversarial examples in the context of autonomous driving as well as on training a network to be robust to perturbations [11,30,46,51,71]. Tools such as DeepXplore [52], DeepTest [65], DeepGauge [41], and SADL [37] define a notion of coverage for neural networks based on the number of neurons activated during tests compared against the total number of neurons in the network and activation during training. However, these techniques focus mostly on individual classification tasks and apply 2D transformations on images. In comparison, we consider the closed-loop behavior of the system and our parameters directly change the world rather than apply transformations post facto. We can observe, over time, that certain vehicles are not detected, which is more useful to testers than a single misclassification [31]. Furthermore, it is already known that structural coverage criteria may not be an effective strategy for finding errors in classification [39]. We use coverage metrics on the test space, rather than the structure of the neural network. Alternately, there are recent techniques to verify controllers implemented as neural networks through constraint solving or abstract interpretation [24, 30, 34, 58, 71]. While these tools do not focus on the problem of autonomous driving, their underlying techniques can be combined in the test generation phase for Paracosm.

#### **6 Future Work and Conclusion**

Deploying autonomous systems like self-driving cars in urban environments raises several safety challenges. The complex software stack processes sensor data, builds a semantic model of the surrounding world, makes decisions, plans trajectories, and controls the car. The end-to-end testing of such systems requires the creation and simulation of whole worlds, with different tests representing different world and parameter configurations. Paracosm tackles these problems by (i) enabling procedural construction of diverse scenarios, with precise control over elements like road layout, physical and visual properties of objects, and behaviors of actors in the system, and (ii) using quasi-random testing to obtain good coverage over large parameter spaces.

In our evaluation, we show that Paracosm enables easy design of environmnents and automated testing of autonomous agents implemented using neural networks. While finding errors in sensing can be done with only a few static images, we show that Paracosm also enables the creation of longer test scenarios which exercise the controller's feedback on the environment. Our case studies focused on *qualitative* state space exploration. In future work, we shall perform *quantitative* statistical analysis to understand the sensitivity of autonomous vehicle behavior on individual parameters.

In the future, we plan to extend Paracosm's testing infrastructure to also aid in the training of deep neural networks that require large amounts of high quality training data. For instance, we show that small variations in the environment result in widely different results for road segmentation. Generating data is a time consuming and expensive task. Paracosm can easily generate labelled data for static images. For driving scenarios, we can record a user manually driving in a parameterized Paracosm environment and augment this data by varying parameters that should not impact the car's behavior. For instance, we can vary the color of other cars, positions of pedestrians who are not crossing, or even the light conditions and sensor properties (within reasonable limits).

**Acknowledgements** This research was funded in part by the Deutsche Forschungsgemeinschaft project 389792660-TRR 248 and by the European Research Council under the Grant Agreement 610150 (ERC Synergy Grant ImPACT).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Compositional Analysis of Probabilistic Timed Graph Transformation Systems***-*

Maria Maximova (-), Sven Schneider , and Holger Giese

University of Potsdam, Hasso Plattner Institute, Potsdam, Germany {maria.maximova,sven.schneider,holger.giese}@hpi.de

**Abstract.** The analysis of behavioral models is of high importance for cyber-physical systems, as the systems often encompass complex behavior based on e.g. concurrent components with mutual exclusion or probabilistic failures on demand. The rule-based formalism of probabilistic timed graph transformation systems is a suitable choice when the models representing states of the system can be understood as graphs and timed and probabilistic behavior is important. However, model checking PTGTSs is limited to systems with rather small state spaces.

We present an approach for the analysis of large-scale systems modeled as probabilistic timed graph transformation systems by systematically decomposing their state spaces into manageable fragments. To obtain qualitative and quantitative analysis results for a large-scale system, we verify that results obtained for its fragments serve as overapproximations for the corresponding results of the large-scale system. Hence, our approach allows for the detection of violations of qualitative and quantitative safety properties for the large-scale system under analysis. We consider a running example in which we model shuttles driving on tracks of a large-scale topology and for which we verify that shuttles never collide and are unlikely to execute emergency brakes. In our evaluation, we apply an implementation of our approach to the running example.

**Keywords:** cyber-physical systems, graph transformation systems, qualitative analysis, quantitative analysis, probabilistic timed systems, compositional analysis, model checking

# **1 Introduction**

Real-time cyber-physical systems often emit a complex behavior based on e.g. concurrent components with mutual exclusion or probabilistic failures on demand. Consequently, modeling formalisms for capturing such systems must suitably support the modeling of their complex behaviors. In such a model driven approach, the analysis of behavioral models w.r.t. a provided specification is vital to ensure overall soundness of the resulting system.

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 241885098, 148420506.

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 196–217, 2021.

https://doi.org/10.1007/978-3-030-71500-7\_10

The rule-based transformation of graphs is a suitable choice when the models representing states of the system can be understood as graphs. In particular, the formalism of probabilistic timed graph transformation systems (PTGTSs) extends the standard rule-based transformation of graphs such that timed and probabilistic behavior is covered by supporting (a) non-deterministic choice among steps, (b) probabilistic choice among step results, and (c) steps representing the passage of time.

A model checking approach for PTGTSs w.r.t. probabilistic metric temporal properties was introduced in [19]. However, also this model checking approach is limited to systems with rather small state spaces due to the state space explosion problem. As a workaround, a selected set of small examples may be considered hopefully capturing all system-specific challenges to establish trust that the model exhibits the required safe behavior and that unwanted behavior is sufficiently unlikely. However, it cannot be excluded that the considered small examples do not reveal all the threatening behavior.

We present a decomposition-based approach for the analysis of large-scale systems modeled as PTGTSs to rule out violations of qualitative and quantitative safety properties.

As a first step, we capture the underlying static large-scale topology (short LST) of a large-scale system as a subgraph that is not changed by graph transformation, describe how a fragment topology (short FT) can be embedded into such an LST (see the left part of Figure 1), and specify how multiple such embeddings of FTs can overlap in their borders (see the right part of Figure 1).

As a second step, based on the decomposition described by such embeddings, we construct for each FT an adapted PTGTS. Such an adapted PTGTS is then ensured to (a) exhibit the same behavior on the non-overlapped part of the FT (named *core*) and to (b) simulate all possible behaviors that can happen for any occurrence of the FT in an LST. To obtain the mentioned simulation, we include modifications of the rules of the original PTGTS operating on the border of an FT into the adapted PTGTS. With this direct relationship between behaviors on the FTs and the LST, we obtain that the likelihood of an unwanted or forbidden graph pattern in one of the adapated PTGTS is an upper bound for its likelihood in its embedding in the large-scale PTGTS.

As a last step, exploiting our decomposition to counter the state space explosion problem, we apply the model checking approach from [19] to the PT- GTSs constructed for the FTs employing its reduction to probabilistic timed automata (PTA) instead of applying the model checking approach directly to the PTGTS modeling the large-scale system.

To illustrate our approach, we consider a running example in which we model shuttles driving on tracks of an LST and for which we verify that shuttles never collide and are unlikely to execute emergency brakes. In our evaluation, we apply an implementation of our approach to the running example.

The idea to decompose a system into subsystems or to compose it from subsystems for the analysis has been studied intensively [25] but our suggested compositional approach has distinguishing characteristics. Firstly, the vast majority of approaches (like process algebras or similar models) assume that the modeling formalism supports the composition/decomposition as a first class concept such that compositional analysis techniques are directly applicable as the subsystem models cover all possible behaviors in all contexts. In contrast, we do not rely on a built-in decomposition operator but rather allow for a flexible derivation of an LST decomposition in terms of FTs, overlappings, and a suitable overapproximation on the border, which are not predefined by the modeling formalism.

Secondly, several approaches rely on a protocol-like specification of how the decomposed subsystems interact, while in our approach the overapproximation is derived systematically from the PTGTS model that does not necessarily provide such a protocol-like specification already. The compositional analysis approach for graph transformation systems (GTSs) from [24, 11] defines explicit interfaces, which are used to consider whether the behavior of two independent graphs glued via these interfaces (requiring that local transitions are compatible) cover jointly all global transitions. Moreover, in further approaches, protocols for the roles of collaborations and ports of components have been assumed. For example, in [14], the idea to overapproximate the environment and border is explored for timed automata with explicit models of the roles in form of protocol automata. This idea has been combined with dynamic collaborations in [12, 13] captured by timed GTSs (TGTSs) and their analysis via inductive invariant checking [3, 4]. Later on, this approach has been extended to role, component, and collaboration behavior, which is captured by TGTSs and hybrid GTSs in [5] and [2], respectively. However, as opposed to the presented approach, in all these cases an explicit concept of interface is assumed to separate parts that are analyzed in isolation.

This paper is structured as follows. In section 2, we introduce our running example from the domain of cyber-physical systems. In section 3, we recapitulate the necessary preliminaries related to PTA and PTGTSs also presenting the modeling of our running example. In section 4, we discuss the decomposition of static substructures of large-scale systems. In section 5, we present our decomposition-based approach allowing to split the model checking problem into more manageable parts. In section 6, we present an evaluation of the conceptual results for our running example. Finally, in section 7, we close the paper with a conclusion and an outlook on planned future work.

#### **2 Running Example**

We now informally introduce a scenario (based on the RailCab project [23]) of autonomous shuttles driving on an LST, which serves as a running example in the remainder of this paper. Based on this introduction, we will discuss how we model this shuttle scenario as a PTGTS in the next section.

In the considered shuttle scenario, a track topology containing a large number of tracks of approximately equal length is given. Tracks are connected to the adjacent tracks via directed connections building in this manner track sequences. Two track sequences can be joined together (i.e., can end up in a common track with two predecessors) leading to a *join* fragment topology (see FT8 in Figure 4a) or can split up from a common track (i.e., a common track has then two successor tracks) leading to a *fork* fragment topology (see FT7 in Figure 4a). Moreover, depots may have a directed connection to a track allowing shuttles to enter or exit the track topology. Shuttles, which are always located on a single track, may be in mode *DRIVE*, *STOP*, or *BRAKE*. Being in mode *DRIVE*, shuttles drive to the next track (respecting the direction of the connection between the tracks) with a certain velocity, which may be slow ([3, 4] time units per track) or fast ([2, 3] time units per track). Regularly, shuttles change into mode *STOP*, which allows them to avoid coming too close to other shuttles. Moreover, shuttles should slow down before entering a track with a construction site on it. However, shuttles noticing the construction site too late have to execute an emergency brake thereby changing into the mode *BRAKE*. To reduce the likelihood of such emergency brakes, yellow traffic lights are installed a few tracks ahead of such construction sites to indicate to shuttles that they should slow down. After construction sites, green traffic lights may be installed permitting shuttles to increase their velocity. However, we also consider failures on demand where a traffic light that is passed by a shuttle is not recognized or, for some other reason, not appropriately taken into account by the shuttle. We assume a failure probability of 10−<sup>6</sup> for this case assuming that the failure does not only depend on the visual observation by the train driver but also depends on a failure of the backup system.

In our running example, *static* elements are the tracks, depots, installed traffic lights, and construction sites as well as connections between these elements. The PTGTS modeling the behavior of the described scenario never changes this underlying LST. Complementary, *dynamic* elements are shuttles, their attributes, their connections to tracks of the LST as well as the attributes of traffic lights. Note that we use later a grammar to generate admissible LSTs.

For the considered shuttle scenario, we are interested in various properties. Firstly, we need to verify that the behavior of the system never gets temporally stuck in a state where no steps (discrete steps of e.g. driving shuttles or timed steps) are enabled. Secondly, we need to verify whether the rules have been constructed in a way ensuring the absence of collisions between shuttles (i.e., two shuttles should not be on a common track). Thirdly, emergency brakes should be improbable at a local level for a single shuttle but also at the global level for the entire LST and its possible numerous number of shuttles.

(g) The rule *ConstructionSiteBrake*: a shuttle with high velocity ([2, 3] time units per track where only the lower end of the interval is stored in the graph) needs to execute an emergency brake to ensure that the track with a construction site on it is not entered with a too high velocity.

Fig. 2: Details for our running example, DPO diagram, and PTA example.

and *DriveExit2* for fragment topologies where parts of the application condition of the rule *Drive* are omitted due to the overlay specification of the running example.

# **3 Preliminaries**

We now briefly introduce the subsequently required details for graph transformation systems (GTSs) [10], probabilistic timed automata (PTA) [17], and probabilistic timed graph transformation systems (PTGTSs) [18, 19] in our notation. Along this presentation, we also discuss the modeling details for our running example from the previous section.

We employ type graphs (cf. [10]) such as the type graph *TG* from Figure 2a for our running example. A type graph describes the set of all admissible (typed attributed) graphs by mentioning the allowed types of nodes, edges, and attributes. We assume typed attributed graphs in which attributes are specified using a many sorted first-order attribute logic as proposed in [21] (the attribute constraint ⊥ (false) in *TG* means that the type graph does not restrict attribute values). This approach to attribution has been used to capture constraints on attributes in graph conditions in [27] and to describe attribute modifications in [22, 28].

Graph transformation is then performed by applying a graph transformation rule (short rule) *ρ* = ( : *K L*,*r* : *K R*) consisting of two monomorphisms (i.e., all components of the morphisms are injective). The rule specifies that the graph elements in *<sup>L</sup>* − (*K*) are to be deleted, the graph elements in *K* are to be preserved, and the graph elements in *R* − *r*(*K*) are to be added during graph transformation. Such a rule is applied to a graph *G* for a given match *m* : *L G* resulting in a graph *G*\*\* by constructing the double pushout (DPO) diagram (see Figure 2c) where the first and the second pushout squares describe the removal and the addition of graph elements specified in the rule, respectively. Moreover, a rule may additionally contain an application condition *φ* (denoted by *ρ* = (,*r*, *φ*)) to rule out certain matches specifying e.g. graph elements that may not be connected to graph elements matched by *m*. For further details on the graph transformation approach, we refer to [10].

PTA [17] combine the use of clocks to capture real-time phenomena and probabilism to approximate/describe the likelihood of outcomes of certain steps. A PTA such as the one in Figure 2d consists of (a) a set of locations with a distinguished initial location such as 0, (b) a set of clocks such as *c*<sup>0</sup> (which are initially set to 0), (c) an assignment of a set of atomic propositions (APs) such as {*done*} to each location (for subsequent analysis of e.g. reachability properties), (d) an assignment of constraints on its clocks to each location as invariants such as *<sup>c</sup>*<sup>0</sup> <sup>≤</sup> 3, and (e) a set of probabilistic timed edges each consisting of (e1) a single source location, (e2) at least one target location, (e3) a clock constraint such as *c*<sup>0</sup> ≥ 2 specifying as a guard when the edge is enabled based on the current values of the clocks, (e4) for each target location a probability such as 0.5 that this target is reached (the sum of all the probabilities for the target locations of the edge must add up to 1 as a probability distribution is required), and (e5) for each target location a set of clocks such as {*c*0} to be reset to 0 when that target location is reached.

States of a PTA are given by pairs (, *v*) where is a location and *v* is the variable valuation mapping each clock of the PTA to a real number. Nondeterminism arises in PTA since a step for advancing time as well as multiple steps applying rules may be enabled in a single state. The logic PTCTL [17] then allows to specify properties such as "what is the worst-case probability that the PTA reaches a location labeled with the AP *done* within 5 time units", which can be analyzed by the PRISM model checker [16]. For the example PTA from Figure 2d, the given condition is satisfied with probability 0.75 since the nondeterminism of the PTA would be resolved (by a so-called adversary) such that the PTA first takes a step to <sup>1</sup> without letting time pass and then performs the probabilistic step (up to two times after waiting for not longer than 2 time units) until it reaches the location <sup>2</sup> labeled with the AP *done* (the probabilistic step cannot be taken a third time due to the requirement of at most 5 time units in the quoted property above).

PTGTSs have been introduced in [18, 19] as a probabilistic real-time extension of GTSs. It has been shown that PTGTSs can be translated to PTA and, hence, PTGTSs can be understood as a high-level language for PTA as discussed below in more detail and can be analyzed using PRISM as well.

Similarly to PTA, a PTGTS state is given by a pair (*G*, *v*) of a graph and a clock valuation. The initial state is given by a distinguished initial graph and a valuation setting all clocks to 0. In our running example, each attribute of type *clockDrive* of a *Track* node (cf. Figure 2a) represents one clock. Invariants and APs are specified for PTGTSs by means of graph conditions as in Figure 2b and Figure 2e, respectively, for our running example. We use the single invariant *INVdriving* requiring that shuttles in mode *DRIVE* cannot be on a track longer than the value of their *minDur* (minimal duration) attribute plus 1. Moreover, we consider three APs to specify properties that we want to analyze later on. The AP *APunexpectedVelocity* is used to detect graphs in which a shuttle does not have an expected velocity of [2, 3] or [3, 4] time units per track where only the lower end of the interval is stored in the graph in the *minDur* attribute. The AP *APcollision* is used to detect graphs in which two shuttles are on a common track to capture their collision. Finally, the AP *APbraked* is used to detect graphs in which a shuttle has just executed an emergency brake.

PTGT rules of a PTGTS then correspond to edges of a PTA and contain (a) a left-hand side graph *L*, (b) an attribute constraint on the clock attributes contained in *L* to capture a guard, (c) a natural number describing a priority where higher numbers denote higher priorities, and (d) a nonempty set of tuples of the form ( : *K L*,*r* : *K R*, *φ*, *C*, *p*) where (,*r*, *φ*) is an underlying GT rule with application condition *φ*1, *C* is a set of clock attributes contained in *L* to be reset, and *p* is a real-valued probability from [0, 1] where the probabilities of all such tuples must add up to 1. See Figure 2f, Figure 2g, and Figure 3a for three PTGT rules *SetSlow*, *ConstructionSiteBrake*, and *Drive* from our running example where the last two PTGT rules have a unique underlying GT rule with probability 1 and where the first PTGT rule has a higher priority as well as two underlying GT rules with probabilities 10−<sup>6</sup> and 1 <sup>−</sup> <sup>10</sup>−6. For the PTGT rules *ConstructionSiteBrake* and *Drive*, we depict the graphs *L*, *K*, and

<sup>1</sup> The underlying GT rule may not delete or add clock attributes.

*R* in a single graph (subsequently called *LKR*-graph) where graph elements to be removed and to be added are annotated with + and ⊕, respectively. In the PTGT rule *SetSlow*, no graph elements are removed or added (i.e., the graphs *L* and *R* of the underlying GT rules coincide). Nevertheless, for this PTGT rule, we depict the two right-hand side morphisms *r*<sup>1</sup> and *r*<sup>2</sup> as they describe PTGT steps with different attribute modifications and probabilities. Also, the PTGT rules *ConstructionSiteBrake* and *Drive* have application conditions, which are depicted left to the symbol or above the symbol. The attribute preconditions and attribute modifications are given for each PTGT rule in the red box below the *LKR*-graph (or are split into multiple red boxes as for the PTGT rule *SetSlow*). In these attribute preconditions and attribute modifications, unprimed (primed) variables denote the values of attributes before (after) GT rule application. Note that if variables are not changed by the GT rule application, we denote this using the operator *unchanged* (see e.g. Figure 2g where *unchanged*(*minD*1, *tid*1, *tid*2) denotes that the variables *minD*1, *tid*1, and *tid*<sup>2</sup> remain unchanged). Moreover, further information about the PTGT rule (i.e., the guard and the priority) but also further information about the probabilistic choices (i.e., the sets of clocks to be reset and probabilities) are depicted in gray boxes. Lastly, we also allow to annotate a PTGT step in the induced state space with (a) a name chosen for the probabilistic choice such as *success* and *failure* in Figure 2f and (b) the values of the variables contained in the list stepLabel (which may contain variables from *L* and *R*).

When comparing PTA and PTGTSs, we observe that PTA edges are either enabled for the current valuation or not whereas PTGT rules may be applicable for many matches at the same time (e.g. allowing to apply the *Drive* for one of multiple shuttles). Priorities used in PTGTSs can be encoded in GTSs (including PTGTSs) by adding the left-hand side graphs of rules with higher priorities as negative application conditions to all rules with a lower priority. Similarly, priorities, if integrated into PTA, could be encoded by refining the guards. However, for our running example, we can exchange the underlying track topology without effort, while this would require a fundamental adaptation of the corresponding PTA. Also, as in [19], we observe in section 6 that small PTGTSs result in PTA of considerable size and we therefore conclude that PTGTSs are typically much more concise compared to PTA.

# **4 Decomposition of Large-Scale Topologies**

We now present our decomposition-based approach to analyze a PTGTS S<sup>0</sup> modeling a large-scale cyber-physical system along the lines of the informal presentation from the introduction. For our running example, such a PTGTS is given by an initial graph typed over the type graph from Figure 2a that is restricted later on in a suitable way, 13 PTGT rules of which we present three in Figure 2f, Figure 2g, and Figure 3a (further rules are given in [20, Appendix]), the invariant from Figure 2b, and the three APs from Figure 2e.

(d) Correspondence of the graph transformation based steps between the large-scale system S<sup>0</sup> and one of its fragment systems S*i*, which are preserving the respective static structure given by *G* and *Fi*.

Fig. 4: FTs for our running example, rule *Merge*, example for topology composition, and correspondence between steps in the large-scale system and a fragment system.

As a first step, we identify a substructure of the initial graph of S<sup>0</sup> that is static in the sense that this substructure is preserved and also never extended throughout all PTGT steps of S0. For large-scale cyber-physical systems such as our running example, the existence of such a static substructure may be justified by a logical or spatial distribution. The embedding of a static substructure *G* in a given graph *G* is then captured by a monomorphism *κ* : *G G* describing how *G* is embedded into *G*. As a special case, such an embedding *κ* can be derived for arbitrary graphs *G* by a monomorphism *κTG* : *TG TG* describing how the given type graph *TG* is restricted to a smaller type graph *TG*. That is, *G* then contains only those elements from *G* that are typed over the smaller type graph *TG*. For our running example, we restrict the type graph *TG* from Figure 2a to such a smaller type graph *TG* by removing the *Shuttle* node with its attributes, the *at* edge connected to the *Shuttle* node, and the *active* attributes from the *TLYellow* and *TLGreen* nodes. The graphs *G* obtained from graphs *G* of S<sup>0</sup> using this restriction are then called *large-scale topologies (LSTs)* and contain for our running example a track topology with depots, traffic lights, and construction sites. Note that the fact that such an underlying LST is indeed preserved and never extended by arbitrary rule applications can be verified (at least for our running example) by inspecting each rule individually using the technique of 1-induction [9, 26].

As a second step, we now introduce the notion of a decomposition of the LST into a small set of (constrained) *fragment topologies (FTs)*. Such (constrained) FTs are given by (a) a graph that is typed over the type graph used for the LST and (b) a graph condition describing constraints on how the graph of the FT may be embedded into graphs of S0. Moreover, an *overlapping specification o* is required to describe how the *embeddings α<sup>i</sup>* of the graphs of two FTs may overlap in the LST. Such an overlapping specification is given by a set of spans (*o*<sup>1</sup> : *O T*1, *o*<sup>2</sup> : *O T*2) where *O* is the *permitted overlapping graph* that is embedded into the two FTs. A decomposition of an LST (in the following definition, we simply consider the LST contained in the initial graph *G*<sup>0</sup> of S0) is then given by embeddings of selected FTs into the LST (cf. Figure 1) such that the overlapping specification is satisfied (the constraints of the FTs are checked for S<sup>0</sup> later on). In applications, to reduce the state space explosion problem for the model checking phase later on, it is advantageous to employ a low number of small FTs that are strictly constrained and are allowed to overlap in a manageable number of ways.

#### **Definition 1 (Decomposition of LST).** *If*


*is some pair* (*o*<sup>1</sup> : *O F*1, *o*<sup>2</sup> : *O F*2) ∈ *o*((*F*1, *φ1*),(*F*2, *φ2*)) *such that for the pushout* (*g*<sup>1</sup> : *F*<sup>1</sup> *P*, *g*<sup>2</sup> : *F*<sup>2</sup> *P*) *of* (*o*1, *o*2) *(i.e., the overlapping of F*<sup>1</sup> *and F*<sup>2</sup> *w.r.t.* (*o*1, *o*2)*) there is some h* : *P G*\* <sup>0</sup> *such that α*<sup>1</sup> = *h* ◦ *g*<sup>1</sup> *and α*<sup>2</sup> = *h* ◦ *g*2*.*

*then M is a decomposition of the LST of* <sup>S</sup><sup>0</sup> *w.r.t. <sup>κ</sup>,* <sup>F</sup>*, and o.*

To provide a better intuition for this definition, we now present the decomposition of the LST considered for our running example.

*Example 1 (Decomposition for Running Example).* Let F contain the constrained FTs (FT*i*, *<sup>φ</sup>i*) for 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> 8 where each FT*<sup>i</sup>* is given in Figure <sup>4</sup><sup>a</sup> (here we use an abbreviated notation where *D*, *T*, *Y*, *G*, and *CS* are the obvious abbreviations for the node types of the type graph) and where *φ<sup>i</sup>* states in each case that shuttles must have a velocity of [2, 3] or [3, 4] time units per track.<sup>2</sup>

Let *o*((*F*1, *φ1*),(*F*2, *φ2*)) be the overlapping specification stating that overlappings (*o*<sup>1</sup> : *O F*1, *o*<sup>2</sup> : *O F*2) of two FTs are always (for any of the 8 × 8 combinations) of the form *O* = *T*<sup>1</sup> → *T*<sup>2</sup> → *T*<sup>3</sup> where *T*<sup>1</sup> and *T*<sup>3</sup> are mapped to a *Track* node in *F*<sup>1</sup> and *F*<sup>2</sup> with an entering and an exiting red arrow by *o*<sup>1</sup> and *o*2, respectively.

An example of a decomposition of an LST employing the previously mentioned FTs and overlapping specification is given in Figure 4c where three FTs are embedded into an LST. To be appropriate later on, the decomposition must ensure that all tracks of the LST are covered by embedding morphisms to which *Shuttle* nodes may be connected (e.g. due to *Shuttle* nodes in the initial graph of S<sup>0</sup> or due to connected *Depot* nodes from which *Shuttle* nodes may enter the LST). In fact, the eight chosen FTs limit the reasoning for our running example to LSTs that can be decomposed using these FTs. ♦

In general, we consider the two use cases: (a) a given PTGTS with underlying LST is to be analyzed and (b) LSTs are to be constructed based on the selected and analyzed FTs. Both use cases are supported but require a different handling. For the use case (a) a parsing of the LST w.r.t. the given FTs and overlapping specification must be performed to obtain a decomposition of the LST. Efficient parsing algorithms have been devised for the special case of hyperedge replacement (HR) grammars (which require that nodes are not deleted) in [8, 6, 7]. A suitable graph transformation based grammar for our running example with 25 rules is given in [20, Appendix]. For the use case (b) in which we need to construct some LST, we may employ node deleting rules. For our running example, consider the rule *Merge* from Figure 4b that can be used to iteratively overlap two FTs starting with a disjoint union of copies of FTs. The rule *Merge* overlaps two instances of three successive *Track* nodes following the overlapping specification where the application condition ensures that the rule is applied at entry and exit points also excluding the possibility that the six matched *Track* nodes belong to an instance of FT*i* using ¬*φ*FT*i*.

<sup>2</sup> For each FT from Figure 4a, this constraint can be formalized as a graph condition.

# **5 Overapproximation of Behavior**

The decompositions of LSTs introduced in the previous section are now used as a foundation to establish a behavioral relationship between a given PTGTS S<sup>0</sup> and *n* PTGTs S*<sup>i</sup>* that operate on the instances of FTs that are embedded into the LST of S<sup>0</sup> according to the given LST decomposition.

For this purpose, we extend the structural embeddings given by the *α* monomorphisms from FTs to the LST in Definition 1 to embeddings of the entire graph (including the static but also the dynamic parts) of a state of some S*<sup>i</sup>* called *fragment topology state (FTS)* into the entire graph of a state of <sup>S</sup><sup>0</sup> called *large-scale state (LSS)*. Consider the left middle square in Figure <sup>4</sup><sup>d</sup> where the embedding *α<sup>i</sup>* together with the FT and LST embeddings *κ<sup>i</sup>* and *κ* is complemented with an embedding *ei* of the FTS *Fi* into the LSS *G*. Note that *ei* must be an extension of *α<sup>i</sup>* in the sense that the square commutes (i.e., *κ* ◦ *α<sup>i</sup>* = *ei* ◦ *κ<sup>i</sup>* is required). Also, *ei* ◦ *κ<sup>i</sup>* must satisfy the constraint *φ<sup>i</sup>* of the FT used for S*i*.

To simplify our presentation, we assume that the PTGTS S<sup>0</sup> (as in our running example) only employs APs of the form ∃(*f* : ∅ *P*, ), invariants of the form ¬∃(*f* : ∅ *P*, ), and application conditions in PTGT rules that are conjunctions of graph conditions of the form ¬∃(*f* : ∅ *P*, ) for some graph *P*. This restriction simplifies the identification of parts of FTSs and LSSs that are considered for an evaluation of such graph conditions.

As a next step, we present a decomposition relation, which establishes a relationship between S<sup>0</sup> and the PTGTSs S*<sup>i</sup>* in terms of embedding monomorphisms *κ*, *αi*, *ei*, and *κ<sup>i</sup>* for all reachable states of S0. Moreover, the decomposition relation requires that (a) the timed and discrete steps of S<sup>0</sup> can be mimicked by each affected S*<sup>i</sup>* and (b) that discrete steps performed by some PTGTS S*<sup>i</sup>* in isolation on a part of the LST where the FT *Fi* does not overlap with the FT *Fj* of another PTGTS S*<sup>j</sup>* with *i* = *j* can be mimicked by S0. That is, the decomposition relation is a simulation for the steps performed by S<sup>0</sup> and a bisimulation on those steps that are performed in isolation by a single PTGTS S*i*. Also, to allow to derive results for S<sup>0</sup> from a model checking based analysis of the PTGTSs S*i*, we require a set of APs A that is part of the APs of S<sup>0</sup> and of each S*i*. Based on this set A, the decomposition relation also requires that only those FTSs and LSSs are related that satisfy the same sets of APs in A. For our running example, this set will contain all three APs of <sup>S</sup><sup>0</sup> (see Figure <sup>2</sup>e). Finally, we require that the initial states of S<sup>0</sup> and the *n* PTGTSs S*<sup>i</sup>* are covered by the decomposition relation.

#### **Definition 2 (Decomposition Relation).** *Given*


*S is a* decomposition relation *between* S<sup>0</sup> *and* (S1,..., S*n*) *containing tuples of the form* ((*G*, *v*), *κ* : *G G*, *w*) *where* (*G*, *v*) *is a state of* S0*, κ identifies the LST of G, and w is a tuple of length n of tuples of the form* (*si*, *Fi*, *φi*, *αi*, *κi*,*ei*) *when the following items are satisfied.*

	- **–** ((*G*, *v*), *κ* : *G G*, *w*) ∈ *S and*
	- **–** S<sup>0</sup> *performs the structural step from* (*G*, *v*) *to* (*G*\*\*, *v*\*\*) *using an underlying GT rule ρ* = ( : *K L*,*r* : *K R*, *φac*) *given in Figure 4d where, since the step of* S<sup>0</sup> *preserves the LST, there are unique κ*\* : *G G*\* *and κ*\*\* : *G G*\*\* *such that* <sup>ˆ</sup>◦ *<sup>κ</sup>*\* <sup>=</sup> *<sup>κ</sup> and <sup>κ</sup>*\*\* <sup>=</sup> *<sup>r</sup>*<sup>ˆ</sup> ◦ *<sup>κ</sup>*\* *, then*
	- **–** ((*G*\*\*, *v*\*\*), *κ*\*\* : *G G*, *w*\*\*) ∈ *S for some w*\*\* *that is obtained pointwise from w by adapting each tuple* ((*Fi*, *vi*), *Fi*, *φi*, *αi*, *κi*,*ei*) ∈ *w into a resulting tuple* ((*F*\*\* *<sup>i</sup>* , *v*\*\* *<sup>i</sup>* ), *Fi*, *φi*, *αi*, *κ*\*\* *<sup>i</sup>* ,*e*\*\* *<sup>i</sup>* ) *as follows. If m*(*L*) ∩ *ei*(*Fi*) = ∅*, then all components of the tuple remain unchanged. Otherwise, the PTGTS* S*<sup>i</sup> must simulate the step and the tuple needs the updating described in the following steps.*
		- *There must be a step of* S*<sup>i</sup> as given in Figure 4d from Fi to F*\*\* *<sup>i</sup> for some underlying rule ρ<sup>i</sup>* = (*<sup>i</sup>* : *Ki Li*,*ri* : *Ki Ri*, *φac*,*i*) *with the same probability and priority as ρ.*
	- **–** ((*G*, *v*), *κ* : *G G*, *w*) ∈ *S,*
	- **–** ((*Fi*, *vi*), *Fi*, *φi*, *αi*, *κi*,*ei*) ∈ *w,*
	- **–** S*<sup>i</sup> performs the structural step from* (*Fi*, *vi*) *to* (*F*\*\* *<sup>i</sup>* , *v*\*\* *<sup>i</sup>* ) *using an underlying GT rule ρ<sup>i</sup>* = (*<sup>i</sup>* : *Ki Li*,*ri* : *Ki Ri*, *φac*,*i*) *given in Figure 4d where, since the step of* S*<sup>i</sup> preserves the FT, there are unique κ*\* *<sup>i</sup>* : *Fi F*\* *<sup>i</sup> and κ*\*\* *i* : *Fi F*\*\* *<sup>i</sup> such that* ˆ *<sup>i</sup>* ◦ *κ*\* *<sup>i</sup>* = *κ<sup>i</sup> and κ*\*\* *<sup>i</sup>* = *r*ˆ*<sup>i</sup>* ◦ *κ*\* *i ,*
	- **–** *ei*(*mi*(*Li*)) *does not overlap with any ej*(*Fj*) *for i* = *j, then*
	- **–** *there is some* ((*G*\*\*, *v*\*\*), *κ*\*\* : *G G*, *w*\*\*) ∈ *S for some G*\*\**, v*\*\**, κ*\*\**, and w*\*\* *as follows.*
		- *There must be a step of* S<sup>0</sup> *as given in Figure 4d from G to G*\*\* *for some underlying rule ρ* = ( : *K L*,*r* : *K R*, *φac*) *with the same probability and priority as ρi.*
		- *Since the step of* S<sup>0</sup> *preserves the LST, there are unique κ*\* : *G G*\* *and the required κ*\*\* : *G G*\*\* *such that* <sup>ˆ</sup>◦ *<sup>κ</sup>*\* <sup>=</sup> *<sup>κ</sup> and <sup>κ</sup>*\*\* <sup>=</sup> *<sup>r</sup>*ˆ*<sup>i</sup>* ◦ *<sup>κ</sup>*\* *.*
		- *The step of* S<sup>0</sup> *must allow for e*\* *<sup>i</sup>* : *F*\* *<sup>i</sup> G*\* *and e*\*\* *<sup>i</sup>* : *F*\*\* *<sup>i</sup> G*\*\* *such that* <sup>ˆ</sup>◦ *<sup>e</sup>*\* *<sup>i</sup>* = *ei* ◦ ˆ *<sup>i</sup> and r*ˆ ◦ *e*\* *<sup>i</sup>* = *e*\*\* *<sup>i</sup>* ◦ *r*ˆ*i.*
		- *Finally, w*\*\* *is obtained from w by only adapting the above chosen tuple* ((*Fi*, *vi*), *Fi*, *φi*, *αi*, *κi*,*ei*) *into the tuple* ((*F*\*\* *<sup>i</sup>* , *v*\*\* *<sup>i</sup>* ), *Fi*, *φi*, *αi*, *κ*\*\* *<sup>i</sup>* ,*e*\*\* *<sup>i</sup>* )*.*

We now state that decomposition relations allow for the simulation of each path of the PTGTS S<sup>0</sup> by the PTGTSs S*i*.

**Lemma 1 (Existence of Simulating Paths).** *If S is a decomposition relation between* S<sup>0</sup> *and* (S1,..., S*n*)*, and π is a path of length m in* S<sup>0</sup> *from the initial state to a state sm, then, for each* 1 ≤ *i* ≤ *n, there is a path π<sup>i</sup> of* S*<sup>i</sup> (of length ki* ≤ *m) ending in a state si*,*ki such that* (*sm*, *κ*, *w*) ∈ *S for some κ and w where the ith element of w is of the form* (*si*,*ki* , *Fi*, *φi*, *αi*, *κi*,*ei*)*. Moreover, the probability of each such path π<sup>i</sup> is at least as high as the probability of the path π. See [20] for the proof.*

We now state that a PTGTS satisfies a safety property given by an AP, when safety w.r.t. this AP can be established for each S*i*.

**Theorem 1 (Safety Verification).** *If S is a decomposition relation between* S<sup>0</sup> *and* (S1,..., S*n*) *w.r.t* A *and ap* ∈ A*, then* S<sup>0</sup> *is safe w.r.t. the occurrence of an ap-labeled graph when (for each* 1 ≤ *i* ≤ *n)* S*<sup>i</sup> is safe w.r.t. the occurrence of an ap-labeled graph. Moreover, the probability of an occurrence of an ap-labeled graph from some state s in* S<sup>0</sup> *is smaller than the probability of an occurrence of an ap-labeled graph from some S-related state si in* S*i. See [20] for the proof.*

We now apply the proposed methodology of establishing a behavioral relationship between the PTGTS S<sup>0</sup> and the PTGTSs S*<sup>i</sup>* to our running example. For this purpose, we now describe how the FTS of each S*<sup>i</sup>* is embedded into the LSS of S<sup>0</sup> and, based on this embedding, how the S*<sup>i</sup>* is derived from S0.

*Example 2 (Construction of Embeddings and Simulating PTGTSs).* Firstly, the embeddings *ei* of FTSs into the LSS are obtained as extensions of the structural embeddings *κ<sup>i</sup>* by also matching (a) all *Shuttle* nodes (with their attributes) that are connected to *Track* nodes contained in the FT via *next* edges and (b) all *active* attributes of *TLYellow* and *TLGreen* nodes contained in the FT. This extension also naturally applies to the initial state of S0. Clearly, two embeddings *ei* and *ej* (for *i* = *j*) only overlap in elements of their FTs but not in the additionally matched dynamic elements.

Secondly, we adapt the given PTGTS S<sup>0</sup> to obtain for each of the eight FTs one PTGTS S*<sup>i</sup>* by (a) changing the initial graph to the source of *ei* capturing the FT as well as the additional dynamic elements of the initial state of S<sup>0</sup> connected to it, (b) adding eight rules for overapproximating the behavior of S<sup>0</sup> on the tracks that may overlap with tracks of other FTs. For the latter point, we observe that all but three of the rules of S<sup>0</sup> (including *SetSlow* and *ConstructionSiteBrake* from Figure 2) are never applicable on the parts of FTs that may overlap with other FTs (i.e., borders of FTs). The remaining three rules are *Drive* from Figure 3a as well as two similar rules for stopping the shuttle that we do not consider in detail here. Three of the four derived rules for rule *Drive* are given in Figure 3.

The additional rule *DriveEnterFast* is used to simulate *Drive* steps where a shuttle in S<sup>0</sup> drives from a track not covered by S*<sup>i</sup>* to a track covered by S*i*. The rule *DriveEnterFast* is essentially constructed by omitting the source track *T*<sup>1</sup> from the rule *Drive*, by adding the shuttle with one of the two expected velocities (the other velocity results in the omitted rule *DriveEnterSlow*) <sup>3</sup>, and by omitting application conditions that may not be satisfied due to the overlapping specification and the structure of FTs.

Similarly, the additional rules *DriveExit1* and *DriveExit2* are constructed from rule *Drive* to allow for the simulation of the two steps in which a shuttle in S<sup>0</sup> drives using rule *Drive* on two tracks covered by S*<sup>i</sup>* to a track not covered by S*i*. These two rules are then constructed similarly, by omitting the tracks *T*<sup>3</sup> (for *DriveExit1*) and *T*<sup>3</sup> and *T*<sup>4</sup> (for *DriveExit2*) from rule *Drive* as these are not covered by the S*i*, by removing the shuttle with its attributes in rule *DriveExit2*, by omitting application conditions that may not be satisfied due to the overlapping specification and the structure of FTs, and by omitting application conditions that refer to the removed tracks.

Note that these additional rules overapproximate the behavior that is possible in S<sup>0</sup> as they may be used when analyzing S*<sup>i</sup>* also when no corresponding shuttle in S<sup>0</sup> is able to enter the FT or when rule *Drive* would be disabled due to the omitted application conditions for the case of rules *DriveExit1* and *DriveExit2*. ♦

For our running example, we now describe the construction of a suitable decomposition relation relying on the LST decomposition introduced before.

<sup>3</sup> Here, we rely on the constraints on the eight FTs (cf. Example 1) requiring that the AP *APunexpectedVelocity* is never labeled in the large-scale system S0.

212 M. Maximova et al.

**Lemma 2 (Existence of Decomposition Relation for Running Example).** *For the PTGTS* S<sup>0</sup> *of our running example with an arbitrary initial LST such that M is a decomposition of that LST w.r.t. some monomorphism κ, the set of eight FTs, and the overlapping specification o from Example 1 there is a decomposition relation S between* S<sup>0</sup> *and the n PTGTSs* S*<sup>i</sup> from Example 2. See [20] for the proof.*

Based on this decomposition relation and Theorem 1, we can obtain the desired overapproximation result for S<sup>0</sup> for the qualitative safety w.r.t. collisions and the quantitative unlikeliness of emergency brakes.

**Corollary 1 (Qualitative and Quantitative Safety for Running Example).** S<sup>0</sup> *exhibits no collisions when this is the case for each* S*i. Moreover, emergency brakes are performed in* S<sup>0</sup> *with a probability not higher than the probability of such an occurrence in any* S*i.*

Note that we only need to analyze one PTGTS for each of the eight permitted FTs w.r.t. the occurrence of collisions and the probability of emergency brakes.

# **6 Evaluation**

To analyze the eight PTGTSs constructed for our running example in section 5 (see Table 1 for the results), we have employed the methodology from [19] generating the state spaces for these PTGTSs without timed steps and then generated the corresponding PTA from these state spaces. We then restricted these PTA to timed automata (TA) essentially removing the information on probabilities, applied UPPAAL [15] to determine the edges of the TA that can never be applied due to unsatisfiable guards, and removed the corresponding edges from the previously generated PTA. The entire analysis using our prototypical implementation required less than three days on a machine using up to 250 GB memory where the state space generation required most of the time. However, there is a vast potential for optimizations regarding memory consumption (by only storing subsequently relevant information on states and steps) and runtime (by facilitating concurrency during state space generation).

Firstly, using UPPAAL, we have verified that each of the eight TA (hence, also the eight PTA) have no reachable deadlock (where also timed steps are disabled). Hence, we obtain that the PTGTS S<sup>0</sup> also does not contain this particular modeling error since, using the decomposition relation, we also obtain that every deadlock reachable in S<sup>0</sup> can be reached analogously in each S*i*.

Secondly, we have observed that the obtained PTA do not label any location with *APunexpectedVelocity* or *APcollision*. For *APunexpectedVelocity* this means that the additional rules such as *DriveEnterFast* and *DriveEnterSlow* for overapproximating the steps of entering shuttles entirely cover all possible velocities of shuttles. For *APcollision* this means that Corollary <sup>1</sup> implies that the PTGTS <sup>S</sup><sup>0</sup> with an LST constructed in the described way from the eight FTs is safe w.r.t. the occurrence of collisions.

Thirdly, to verify that yellow traffic lights suitably slow down the shuttles before construction sites, we have identified locations *<sup>i</sup>* in the resulting PTA


that are labeled with *APbraked* (occurring only in FT4 and FT5). In each case, we were able to track using a custom analysis algorithm (since the PRISM model checker was too slow for the large PTA at hand) the shuttle backwards over all possible paths leading to such a location *<sup>i</sup>* up to the step where the shuttle entered the FT. We then determined the maximal probability of any such path obtaining a worst-case emergency brake probability of 10−<sup>6</sup> and 10−<sup>12</sup> for any entering shuttle in FT4 and FT5, respectively. On the one hand, FT5 is thereby verified to be quantitatively more desirable compared to FT4. On the other hand, Corollary 1 implies that installations of yellow traffic lights as in FT4 and FT<sup>5</sup> suitably decrease the likelihood of emergency brakes also for <sup>S</sup>0. However, the probabilities that some shuttle executes an emergency brake in a given time span in FT4/FT5 (obtained by combining the maximal throughput of shuttles for FT4/FT5 with the worst-case probability obtained for FT4/FT5) can be expected to be too coarse upper bounds when the maximal throughput is not to be expected for the real system.

### **7 Conclusion and Future Work**

We presented an analysis approach for large-scale systems modeled as PT-GTSs for which model checking is not feasible. In this approach, we rely on a decomposition of an underlying static large-scale topology into fragment topologies of manageable size. Model checking is then applied for each fragment topology and an adaptation of the PTGTS to such a fragment topology. We thereby determine (a) overapproximations of reachability properties important for qualitative safety properties and (b) upper bounds for probabilistic reachability properties important for quantitative safety properties.

As future work, we intend to extend our analysis to fairness properties and conditions of the metric temporal graph logic (MTGL) [29]. Also, to cover further aspects of the RailCab project [23], we will develop more general decomposition schemes where dynamic components (such as connected shuttles driving in convoys) may be covered by multiple fragment topologies. Lastly, to further evaluate applicability of our approach, we intend to apply it to other case studies as e.g. the one discussed in [1].

# **References**


*2004, Proceedings*. Ed. by Yassine Lakhnech and Sergio Yovine. Vol. 3253. Lecture Notes in Computer Science. Springer, 2004, pp. 293–308. isbn: 3-540-23167-6. doi: 10.1007/978-3-540-30206-3\_21.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Efficient Bounded Model Checking of Heap-Manipulating Programs using Tight Field Bounds**

Pablo Ponzio<sup>1</sup>,<sup>3</sup> , Ariel Godio<sup>2</sup>, Nicol´as Rosner, Marcelo Arroyo<sup>1</sup>, Nazareno Aguirre<sup>1</sup>,<sup>3</sup> and Marcelo F. Frias<sup>2</sup>,<sup>3</sup>

<sup>1</sup> University of R´ıo Cuarto, R´ıo Cuarto, Argentina {pponzio,marcelo.arroyo,naguirre}@dc.exa.unrc.edu.ar <sup>2</sup> Buenos Aires Institute of Technology (ITBA), Buenos Aires, Argentina {agodio,mfrias}@itba.edu.ar <sup>3</sup> National Council for Scientific and Technical Research (CONICET), Buenos Aires,

Argentina

**Abstract.** Software model checkers are able to exhaustively explore different bounded program executions arising from various sources of nondeterminism. These tools provide statements to produce non-deterministic values for certain variables, thus forcing the corresponding model checker to consider *all* possible values for these during verification. While these statements offer an effective way of verifying programs handling basic data types and simple structured types, they are inappropriate as a mechanism for nondeterministic generation of pointers, favoring the use of insertion routines to produce dynamic data structures when verifying, via model checking, programs handling such data types.

We present a technique to improve model checking of programs handling heap-allocated data types, by taming the explosion of candidate structures that can be built when non-deterministically initializing heap object fields. The technique exploits precomputed *relational bounds*, that disregard values deemed invalid by the structure's type invariant, thus reducing the state space to be explored by the model checker. Precomputing the relational bounds is a challenging costly task too, for which we also present an efficient algorithm, based on incremental SAT solving. We implement our approach on top of the CBMC bounded model checker, and show that, for a number of data structures implementations, we can handle significantly larger input structures and detect faults that CBMC is unable to detect.

#### **1 Introduction**

SAT-based bounded model checking [7] is an automated software analysis technique, consisting of appropriately encoding a program as a propositional formula in such a way that its satisfying valuations correspond to program defects, such

<sup>-</sup> Nicol´as Rosner was affiliated with the University of Buenos Aires, Buenos Aires, Argentina at the time of contribution to this work.

<sup>©</sup> The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 218–239, 2021. https://doi.org/10.1007/978-3-030-71500-7 11

as violations of assertions, uncaught exceptions and memory leaks. Satisfying valuations of the obtained propositional formulas can be automatically searched for by resorting to SAT solving, exploiting the constant advances in this analysis technology. SAT-based bounded model checking achieves full automation in program verification at the cost of completeness: it limits the number of times that loops are allowed to be executed to a user provided loop unwinding bound. This in turn limits the data that the program can manipulate, which is constrained to the program parameters, and what the program can allocate in its bounded executions. Hence, although the approach is capable of exploring a huge number of execution traces, it cannot prove program correctness due to its bounded nature. Nevertheless, it is very useful for bug finding, and is able to support fully-fledged higher-level programming languages [8].

A tool based on bounded model checking over SAT is CBMC [20]. It supports all of ANSI-C, including programs handling pointers and pointer arithmetic. The tool is able to exhaustively explore many user-bounded program executions resulting from various sources of non-determinism, including scheduling decisions and the assignment of values to program variables. To achieve this, CBMC provides statements to produce non-deterministic values for certain variables, forcing the model checker to consider *all* possible values for these variables during verification. These statements enable program verification on *all* legal inputs, by assigning these inputs values within their corresponding (legal) domains. While this mechanism is effective for the verification of programs manipulating basic data types and simple structured types, it is disabled as a feature for the generation of pointers. This issue forces the user to provide an ad-hoc environment to verify programs handling dynamic data structures. In fact, a typical, convenient mechanism to verify programs handling heap-allocated linked structures using CBMC and similar tools, is to non-deterministically build such structures using insertion routines [19, 22, 11].

The aforementioned approach, while effective, has its scalability tied to how complex the insertion routines are, and how many of these are actually needed. Indeed, there are many linked structures whose domain of valid structures cannot be built only via insertion operations (e.g., red-black trees and node caching linked lists require insertions as well as removals, in order to reach all bounded valid structures). In this paper, we study an alternative technique for verifying, using CBMC, programs handling heap-allocated linked structures. The approach essentially consists of building a pool of objects with nondeterministically initialized fields, which are then used for nondeterministically building structures. The rapid explosion in the number of generated linked structures is tamed by exploiting precomputed bounds for fields, that disregard values deemed invalid by the structure's assumed properties, such as datatype invariants and routine preconditions. This leaves us the additional problem of precomputing these bounds, a computationally costly task on its own. We then present a novel algorithm for these precomputations, based on incremental SAT solving, making the whole process fully automated.

```
avl_init(t);
int size = nondet_int();
__CPROVER_assume(size>=0 && size<=MAX_SIZE);
for (int i = 0; i < size; i++) {
    int value = nondet_int();
    __CPROVER_assume(value >= MIN_VAL && value < MAX_VAL);
    avl_insert(t, value);
}
int r_value = nondet_int();
__CPROVER_assume(r_value >= MIN_VAL && r_value < MAX_VAL);
avl_remove(t, r_value);
__CPROVER_assert(avl_repok(t));
```
Fig. 1: Verification of AVL remove, building structures by multiple insertions.

We perform an experimental evaluation on a benchmark of data structure implementations, showing that the use of field bounds contributes significantly to improve both memory consumption and verification running times (including the precomputations), allowing CBMC to consider larger structures as well as to detect faults that could not be detected without their use.

### **2 A Motivating Example**

Let us start by describing a particular verification scenario, that will serve the purpose of motivating our approach. Suppose that we have an implementation of dictionaries, based on AVL trees; furthermore, we would like to verify that the remove operation on this structure preserves the structure's invariant, i.e., after a removal is performed, the resulting structure is still a valid AVL tree (acyclic, with every node having at most one parent, sorted, and balanced). Moreover, let us assume that, besides operation avl remove, we have AVL's avl init, avl insert and avl repok, the latter being a routine that checks whether a given structure satisfies the AVL invariant, as described above. In order to perform the desired verification, we can proceed by building the program shown in Figure 1. Notice how this program:


When running CBMC on this program, if loops are unwound enough and no violation of the assertion is obtained, then we have verified that, within the provided bounds, remove indeed preserves the invariant.

The above traditional approach to verifying linked structures using CBMC and similar tools [19, 22, 11] has its efficiency tied to how complex the involved routines are, in particular the insertion routine(s) (the avl remove routine, being verified, cannot be avoided).

```
t = nondet_avl(MAX_SIZE, MIN_VAL, MAX_VAL);
__CPROVER_assume(avl_repok(t));
int r_value = nondet_int();
__CPROVER_assume(r_value >= MIN_VAL
  && r_value < MAX_VAL);
avl_remove(t, r_value);
__CPROVER_assert(avl_repok(t));
                                                avlnode* nondet_avl(int size,
                                                     int min_val,
                                                     int max_val) {
                                                  avlnode *n = malloc(sizeof(avlnode) * size);
                                                  avlnode *result = NULL;
                                                  if (nondet_bool())
                                                     // root is null
                                                     return result;
                                                  result = n[0]; // root is n0
                                                  n[0]->left = NULL;
                                                  if (nondet_bool())
                                                     n[0]->left = n[1];
                                                  n[0]->right = NULL;
                                                  if (nondet_bool())
                                                     n[0]->right = n[1];
                                                  else if (nondet_bool())
                                                     n[0]->right = n[2];
                                                  ...
                                                  return result;
                                                }
```
Fig. 2: Verification of AVL remove, nondeterministically building linked structs

An alternative approach, employed by some symbolic execution-based model checkers, notably [3, 23], consists of creating a pool of nodes, whose fields are nondeterministically set, and which are also nondeterministically used to build data structures. The process is illustrated in Figure 2. The key is in the use of a routine nondet avl(), which encapsulates the generation of the linked structure. A fragment of this routine is shown at the right of Figure 2. Notice how this routine will generate invalid structures, e.g., cyclic ones. The avl repok(t) assumption after the generation will take care of disregarding these invalid structures for verification. Notice how our manually written example generation routine is avoiding to use any node besides n[0] as the root, or any node but n[1] as n[0]->left, thus avoiding some isomorphic structures and obvious cycles, but it does not avoid nodes from having more than one parent, nor it seems to take into account the tree's balancedness. Of course, we have other alternatives when writing the nondeterministic generation routine nondet avl. We may produce a generation routine that, based solely on the fields of the nodes involved in the structure and their types, produces all possible structures, leaving the work of filtering out valid ones to the assume(avl repok(t)) part of the program. We can also write a sophisticated generation routine specifically tailored for AVL trees, that already takes into account (most) invalid values for each node field, and thus mostly produces valid structures. The first option has as an advantage that it is *generic*, and thus can be made part of an automated verification technique, at the cost of being, intuitively, less efficient; the second (and our example), on the other hand, has in principle to be manually produced, and is more error prone, since we may be disregarding some valid values making the verification bounded incomplete, but is intuitively more efficient.

The technique we present in this paper consists of automatically producing the second kind of generation routines. We will start with the first kind of generation, and automatically decide which values for each field of each node can be safely removed, when we can establish that they do not participate in valid structures (i.e., structures satisfying the corresponding structure invariant). This additional problem of deciding when a value for a node field's domain can be safely removed is solved using a novel algorithm, presented in this paper, which uses incremental SAT solving.

# **3 Tight Field Bounds**

Tight field bounds are based on a relational semantics of structures' fields in program states. The relational semantics of structures is based on interpreting a field f at a given program state as the set of pairs id, v relating the identifier id (representing a unique reference to some data object o in the heap) with the value v in the field f of o at that state (i.e., o->f = v in the state). Then, each program state corresponds to a set of (functional) binary relations, one per field of the structures involved in the program. For example, the program state containing the binary tree depicted at the left of Fig. 3 are represented by the following relations:

$$\mathbf{1} \mathbf{e} \mathbf{t} \mathbf{t} = \{ \langle N0, N1 \rangle \,\langle N1, N3 \rangle \}, \quad \mathbf{r} \mathbf{i} \mathbf{g} \mathbf{t} = \{ \langle N0, N2 \rangle \,\langle N1, N4 \rangle, \langle N2, N5 \rangle \} \tag{1}$$

For analysis techniques that must consider all possible state configurations that satisfy some given property, we may reduce this relational semantics by considering *tight field bounds*. Intuitively, for a field f and a property α, its tight field bound on α is the union of f's representation across all program states that satisfy α. Tight field bounds have been used to reduce the number of variables and clauses in propositional representations of relational heap encodings for Java automated analyses [14, 13, 2], and in symbolic execution based model checking to prune parts of the symbolic execution search tree constraining nondeterministic options [15, 26] (see section 6 for a more detailed description of these previous applications). Tight field bounds are computed from *assumed* properties, and can be employed to restrict structures in states that are assumed to satisfy such properties, i.e. *precondition* states. In our case, we will use the invariant of the structure, as opposed to stronger preconditions, so that these can be reused across several routines of the same structure.

**Definition 1.** *Let* f *be a field of structure* T<sup>1</sup> *with type* T2*. Let* i *and* j *be the scopes for types* T<sup>1</sup> *and* T2*, respectively. Let* A = {a1,...,ai} *be the identifiers for data objects of type* T1*, and let* B = {b1,...,bj} *be the identifiers for data objects of type* T2*. Given an identifier* k*,* o<sup>k</sup> *denotes the corresponding data object. The tight field bound for field* f *is the smallest relation* U<sup>f</sup> ⊆ A×(B+*Null*) *satisfying:* x, y ∈ U<sup>f</sup> *iff there exists a valid heap instance* I *in which* ox->f = oy*.*

By *scope* we mean the limit in the number of objects, ranges for numerical types, and maximum depth in loop unwinding, as in [17, 12]. An important assumption we make for analysis is that structure invariants do not refer to the specific heap addresses of data objects, and in particular that these do not

Fig. 3: Two valid binary trees.

use pointer arithmetic. Therefore, permuting data object identifiers on a valid instance still yields a valid instance (i.e., permuting the actual locations of data objects in the heap is irrelevant for invariant satisfaction). This is most times the case, and is indeed the case in all the examples that we will present in Section 5. This is an important assumption because it enables us to add an additional implicit invariant: *symmetry breaking*. This has an important impact in the size of tight field bounds, since they get greatly reduced when isomorphic structures are removed. We use a symmetry breaking procedure that removes all symmetries. For details, we refer the reader to [14, 13].

# **4 A Technique for Nondeterministic Generation of Dynamic Structures**

We are now ready to describe the technique for nondeterministic generation of dynamic structures, used to verify programs handling such data using CBMC. The technique requires:


The first three are necessary information; for the last one we present later on in the paper an algorithm to compute tight bounds, from the other three.

The technique starts by building a routine nondet T(), that produces and returns structures of type T. The routine works as follows. First, for every (pointer) type Tt involved (including T), we start by allocating n (the scope) data objects:

```
Tt *tt_nodes = malloc(sizeof(Tt) * n);
```
Then, for every structure pointer type Ts (for which we already allocated n data objects) and field f of type Tt in Ts, we build the following nondeterministic assignment:

```
ts_nodes[0]->f = NULL;
if (nondet_bool()) ts_nodes[0]->f = tt_nodes[0];
else if (nondet_bool()) ts_nodes[0]->f = tt_nodes[1];
...
ts_nodes[1]->f = NULL;
if (nondet_bool()) ts_nodes[1]->f = tt_nodes[0];
else if (nondet_bool()) ts_nodes[1]->f = tt_nodes[1];
...
```
Finally, nondet T() ends by returning either NULL or t nodes[0] (no other non-null node is necessary, due to symmetry breaking). Using nondet T(), we build the following verification harness for p:

```
T x = nondet_T();
__CPROVER_assume(repok(x));
p(x);
__CPROVER_assert(repok(x));
```
Of course the last assertion can be replaced by any expected property of p.

We now turn our attention to the use of tight field bounds to reduce nondeterminism in nondet T(). For every structure Ts and field f with type Tt declared in Ts, if NT s <sup>i</sup> NT t <sup>j</sup> does not belong to the tight bound B<sup>f</sup> , then we remove from nondet T() the line:

if (nondet\_bool()) ts\_nodes[i]->f = tt\_nodes[j];

To illustrate the benefits of using tight field bounds in this setting, compare the two (semantically equivalent) nondet avl() methods in Figure 4 for building AVLs with size at most 4. At the left of Figure 4, we show the code for the approach that considers all the feasible assignments to nodes' fields within the scope (many assignments not displayed due to the lack of space). With precomputed tight field bounds we can discard a significant number of these assignments, that are not allowed due to the bounds, as shown at the right of Figure 4. Notice that, among many others, all self-loops in nodes are discarded by the bounds.

#### **4.1 Computing Tight Field Bounds**

For the rest of this section we assume a fixed structure T, with fields f1,...,f<sup>m</sup> and representation invariant repok, and a fixed scope k. Tight field bounds for T can be automatically computed from assumed properties such as invariants and preconditions. These properties must be expressed in a language amenable to automated analysis, reducible to SAT-based analysis in our case. We employ the automated translation of the definition of T and its repok to a propositional formula implemented in the TACO tool [14, 13]. We also assume a symmetry breaking predicate is created by this translation, forcing canonical orderings of heap nodes in structures (see [14, 13] for a careful description of how these symmetry-breaking predicates are automatically built). We discuss below the

```
avlnode* nondet_avl() {
  avlnode *n = malloc(sizeof(avlnode)*4);
  if (nondet_bool())
    return NULL;
  avlnode *result = n[0];
  // assignments to n[0]'s fields
  n[0]->left = NULL;
  if (nondet_bool())
    n[0]->left = n[0];
  else if (nondet_bool())
    n[0]->left = n[1];
  else if (nondet_bool())
    n[0]->left = n[2];
  else if (nondet_bool())
    n[0]->left = n[3];
  n[0]->right = NULL;
  if (nondet_bool())
    n[0]->right = n[0];
  else if (nondet_bool())
    n[0]->right = n[1];
  else if (nondet_bool())
    n[0]->right = n[2];
  else if (nondet_bool())
    n[0]->right = n[3];
  n[0]->height = 0;
  if (nondet_bool())
    n[0]->height = 1;
  else if (nondet_bool())
    n[0]->height = 2;
  else if (nondet_bool())
    n[0]->height = 3;
  // assignments to n[1], n[2] and n[3]'s
  // fields follow a similar pattern to
  // n[0]'s and are ommited
  return result;
}
                                                avlnode* nondet_avl() {
                                                   avlnode *n = malloc(sizeof(avlnode)*4);
                                                   if (nondet_bool()) return NULL;
                                                   avlnode *result = n[0];
                                                   // assignments to n[0]'s fields
                                                   n[0]->left = NULL;
                                                   if (nondet_bool())
                                                     n[0]->left = n[1];
                                                   n[0]->right = NULL;
                                                   if (nondet_bool())
                                                     n[0]->right = n[1];
                                                   else if (nondet_bool())
                                                     n[0]->right = n[2];
                                                   n[0]->height = 1;
                                                   if (nondet_bool())
                                                     n[0]->height = 2;
                                                   else if (nondet_bool())
                                                     n[0]->height = 3;
                                                   // assignments to n[1]'s fields
                                                   n[1]->left = NULL;
                                                   if (nondet_bool())
                                                     n[1]->left = n[3];
                                                   n[1]->right = NULL;
                                                   if (nondet_bool())
                                                     n[1]->right = n[3];
                                                   n[1]->height = 1;
                                                   if (nondet_bool())
                                                     n[1]->height = 2;
                                                   // assignments to n[2]'s fields
                                                   n[2]->left = NULL;
                                                   if (nondet_bool())
                                                     n[2]->left = n[3];
                                                   n[2]->right = NULL;
                                                   if (nondet_bool())
                                                     n[2]->right = n[3];
                                                   n[2]->height = 1;
                                                   if (nondet_bool())
                                                     n[2]->height = 2;
                                                   // assignments to n[3]'s fields
                                                   n[3]->left = NULL;
                                                   n[3]->right = NULL;
                                                   n[3]->height = 1;
```
Fig. 4: Building AVLs with size at most 4. Left: all feasible assignments to node's fields. Right: only assignments deemed feasible by tight field bounds

return result; }

parts of the translation that are important for the understanding of our approach, and refer the reader to the literature for additional details [14, 13].

Let f be a field of T with type T'. Let A = a1,...,a<sup>k</sup> and B = b1,...,b<sup>k</sup> be the identifiers for data objects of type T and T' within scope k, respectively. This bounded field is then a relation f ⊆ A × (B + *null*). The propositional encoding of f consists of boolean variables fi,j , 0 ≤ i, j < k, such that fi,j = T rue in a instance I if and only if the value of f for object a<sup>i</sup> is equal to object b<sup>j</sup> (i.e. ai->f = b<sup>j</sup> ) in I (the original translation has variables representing ai->f = null, we omit these here to simplify the presentation).

As an example, Figure 5 below shows the propositional variables representing all the feasible values of binary trees' left and right fields for scope 6, in tabular form. In the tables, object identifiers are named N<sup>i</sup> (0 ≤ i < 6), variables li,j (0 ≤ i, j < 6) denote Ni->lef t = N<sup>j</sup> (similarly, ri,j denote Ni->right = N<sup>j</sup> ).


Fig. 5: Propositional encodings of binary trees' left and right fields for a scope of 6

In this way, the binary tree at the left of Figure 3, whose relational representation is given in equation 1, is defined exactly by setting the following variables to true (and all the remaining variables to false):

$$\mathbf{1}\mathbf{e}\mathbf{f}\mathbf{t} = \{l\_{0,1}, l\_{1,3}\}, \quad \mathbf{r}\mathbf{i}\mathbf{g}\mathbf{t} = \{r\_{0,2}, r\_{1,4}, r\_{2,5}\} \tag{2}$$

As each propositional variable in the encoding of a field represents exactly the fact that a single pair of objects belongs to the field, in the following we will speak of these two notions (propositional variables and pairs of objects related by a field) interchangeably. In fact, as our approach operates with propositional formulas (needed for exploiting incremental SAT solving), the tight field bounds will be represented and computed in terms of propositional variables. It is straightforward to see that if variable fi,j belongs to the tight field bound for field f, then ai, b<sup>j</sup> is a feasible pair in the relational semantics (and is infeasible if fi,j does not belong to the tight field bound).

It is worth noticing that deciding if there exists a structure with a particular field value, say ai->f = b<sup>j</sup> , can be accomplished by querying the solver about the satisfiability of a formula consisting of a propositional encoding of the structure and the invariant (prop repok), the propositional encoding of the symmetry breaking predicate (prop sbpred), and the corresponding variable fi,j :

$$\texttt{prop.report} \land \texttt{prop.subpred} \land f\_{i,j} \tag{3}$$

In case the satisfiability verdict is true, the valuation returned by the solver corresponds to a *valid* (in the sense that it satisfies the invariant) memory heap, containing pair ai, b<sup>j</sup> in the relational representation of f. Also, from the valuation we can retrieve for each field f all the (true) variables that represent pairs of objects related by f in that particular heap.

The formula above can be used to compute tight bounds, determining what are the infeasible variables fi,j (and hence the corresponding pairs in the fields' semantics), in states that satisfy the invariant. In [14], the infeasible variables are determined using a top-down algorithm. In the algorithm therein, the field semantics is initially set, for a field of type B declared in structure A, to A × (B ∪ {*null*}). From this fully populated initial semantics, each pair is checked for feasibility. Pairs found to be infeasible are removed from the bound. Adopting this top-down approach for computing tight field bounds leads to feasibility checks (a large number of these) that are *independent* from one another, thus making it amenable to distributed processing. Moreover, a pair can be removed from the bound as soon as it is deemed infeasible, which can be exploited to compute tight field bounds "non-exhaustively", e.g., dedicating a certain time to the computation of tight field bounds, and taking the obtained tight field bound for improving SAT analysis, regardless of whether the tight bound is the *tightest* (it converged to removing all infeasible pairs) or not. The latter can be achieved thanks to the fact that, in the top-down approach, intermediate bounds are also tight bounds [14, 13]. As each SAT query in this top-down approach is independent from the rest, the algorithm does not exploit the incremental capabilities of modern SAT solvers.

Let us present our approach to compute tight field bounds. As opposed to the technique in [14], our algorithm operates in a *bottom-up* fashion. In our presentation below, we assume a propEncoding method that takes the repok, a symmetry breaking predicate sbpred, and the scopes scope, and returns an encoding object. Its getPropositionalFormula method creates and returns a CNF propositional formula, encoding the repok and sbpred for the given scope. Also, the encoding's getVars(f) method returns all the propositional variables in the encoding of field f (see Figure 5). The algorithm uses an *incremental* SAT solver, represented by a module solver, with the following routines:


The pseudocode of our algorithm is shown in Figure 6. Line 3 builds a propositional encoding using the repok, the symmetry breaking predicate sbpred and the scopes. The CNF propositional formula produced by the encoding object is then loaded into the solver in Line 4. Lines 5-7 initialize sets vars f1,..., vars fm, each containing all the propositional variables in the encoding of the corresponding fields f1, ··· , fm. As opposed to the top-down algorithm proposed in [14], which initialized fields' semantics as binary relations containing all pairs, the bottom-up algorithm starts with empty sets feasible f1,..., feasible f<sup>m</sup> (lines 8-10). feasible f1, ..., feasible f<sup>m</sup> are used by the algorithm to store partial bounds for the corresponding fields f1, ··· , fm, and will be iteratively extended with the true variables in instances returned by the SAT solver.

A crucial step in our algorithm is performed at line 12, where the current formula loaded in the SAT solver is extended, exploiting incremental SAT solving [16], with a progress-ensuring constraint on heap fields. Here, we add a clause that consists of the disjunction of all the variables in the encoding of fields that have not been previously added to the feasible f1,..., feasible f<sup>m</sup> sets. The purpose of is to ensure that instances returned by solver.solve() in Line 13 contain at least one pair that does not belong to the sets already held in feasible f1,..., feasible fm. Intuitively, by adding the clause in line 12, the call to solver.solve() in line 13 can be interpreted as *"find a valid heap instance of the data structure that can be used to extend at least one of the current bounds in* feasible f1,..., feasible fm*"*. If such an instance exists, it is returned by the solver.getModel() method, and stored in the model variable in line 14. The variables that are true in model are then added to the feasible f1,..., feasible f<sup>m</sup> sets in lines 15-19. The loop terminates when feasible f1,..., feasible f<sup>m</sup> cannot be augmented any further (lines 20, 21), in which case, as we prove below, these sets hold tight field bounds and are returned by the algorithm (line 24).

As an example, assume we are computing tight field bounds for binary trees, and that the invocation to solver.solve() returned the instance at the left of Figure 3. Then, the variables in sets lef t and right shown in equation 2 will be added to feasible left and feasible right, respectively, in lines 15-19. Notice that this forces an instance with at least one variable not in the left or right sets to be returned by solver.solve() in the next iteration.

It is worth noticing the importance of the progress-ensuring constraint in line 12, being encoded as a clause. This is what enables the possibility of using *incremental SAT solving* [16] in our tight bounds computation. Essentially, incremental SAT solvers allow one to append further constraints after each satisfying valuation is found, as long as these are in CNF. These constraints are conjoined with the main (CNF) formula, and used in computing the "next" satisfying instance without having to restart the solving process (which is a very time consuming process). Also, this allows the solver to exploit the learned clauses (that summarize the conflicts found by the solver in the search of satisfying valuations) to help accelerate subsequent queries [10]. Notice that, if the new constraints were not in CNF, the whole resulting formula would have to be translated to CNF and the SAT process restarted from scratch.

Theorem 1 proves our algorithm terminates and computes tight field bounds.

#### **Theorem 1.** *Algorithm 6 terminates and returns valid tight field bounds.*

*Proof.* Termination easily follows from the following two facts: *(i)* for given bounds on data domains of the structure under analysis and limited by *scopes*, the number of pairs that can be added to a field bound is a finite number; and *(ii)* each while-loop iteration either adds at least an extra pair to the bounds, or otherwise returns *unsat*, in which case the loop terminates.

To prove that the algorithm yields tight field bounds, we proceed as follows. Notice that at each iteration, and for any field fi, the bound associated to field f<sup>i</sup> (feasible fi) is a subset of the corresponding tight bound, i.e., contains only feasible variables: the initial bound (∅) is obviously a subset of the tight

```
1 procedure bottom−up ( repok , sbpred , scopes )
2 begin
3 encoding = propEncoding ( repok , sbpred , scopes )
4 solver . load (encoding . getPropositionalFormula ())
5 vars f 1 = enconding . getVars ( f 1 )
6 ...
7 vars f m = enconding . getVars ( f m )
8 feasible f 1 = {}
9 ...
10 f e a s i b l e f m = {}
11 while T rue do
12 s o l v e r . addClause (j∈1,..,m,
                           v∈(vars fj\feasible fj )
                                           v )
13 i f s o l v e r . s o l v e ( ) = SAT then
14 model = s o lver . getModel ()
15 f e a s i b l e f 1 = feasible f 1 ∪
16 {v | v <− vars f 1 and model . getValue (v) = T rue}
17 ...
18 f e a s i b l e f m = feasible f n ∪
19 {v | v <− vars f m and model . getValue (v) = T rue}
20 else \\ UNSAT
21 break
22 f i
23 done
24 return feasible f 1 , ... , feasible f m
25 end
```
Fig. 6: Bottom-up algorithm for tight field bounds computation

bound, and bounds are extended only by adding variables extracted from valid structures (i.e., each loop iteration produces a valid expansion). An inductive argument allows us to conclude that, on termination, the bound associated to field f<sup>i</sup> (feasible fi) is a subset of the tight bound. We will now show that feasible f<sup>i</sup> is the tight field bound. Let us suppose that, once the algorithm terminates, bound feasible f<sup>i</sup> is not tight, i.e., there exists a variable vw,z that does not belong to feasible fi. Then, there must exist a canonical (i.e., satisfying symmetry breaking) instance I of repok within scopes, in which ow->f<sup>i</sup> = oz. Therefore, I satisfies repok, sbpred, and vw,z = T rue, contradicting the fact that the algorithm had terminated. Therefore, all variables excluded from feasible f<sup>i</sup> are infeasible, making this bound tight.

As opposed to the top-down algorithm for tight bounds introduced in [14, 13] Algorithm 6 only provides useful information once it terminates – intermediate bounds cannot be used to improve analysis. Moreover, whereas the top-down approach lends itself well to parallelization (as we mentioned before, it implies a large number of independent SAT queries, that can be solved in a distributed manner), it is not obvious how one would reasonably distribute our new bottomup computation. Nevertheless, as we will show in Section 5, the sequential Algorithm 6 and its optimizations (i.e. the usage of incremental SAT-solving) are substantially faster than the parallel, distributed, top-down approach.

# **5 Evaluation**

Our first experimental evaluation assesses the impact of tight field bounds in verification of code handling linked structures using CBMC. The evaluation is based on a benchmark of collection implementations, previously used for tight field bounds computation in [14, 13], composed of data structures with increasingly complex invariants:


Experiments in this section were run on workstations with Intel Core i7 4790 processor, 8Mb Cache, 3.6Ghz (4 Turbo), and 16 Gb of RAM, running GNU/Linux. The incremental SAT solver used was Minisat 2.2.0. We denote by OOM that the 16GB of memory were exhausted, and by OOM+ that the 16GB where exhausted while CBMC was preprocessing; in this latter case no numbers of clauses or variables were produced by CBMC. Timeout was set for these experiments to 1 hour.

Table 1 reports, for the most relevant routines of each of the data structures in our benchmark, the verification running times with the underlying decision procedure running times discriminated in seconds, as well as the number of clauses and variables (expressed in thousands) in the CNF formulas corresponding to each of the verification tasks, for several scopes (S). Since we checked whether the routines preserved the corresponding structure's invariant, we did not consider for the experiments those routines that did not modify the structure (these trivially preserve the invariant). We assessed three different approaches:


Some remarks on the results are in order. Table 1 shows that in all analyzed routines, the TFB approach allowed us to analyze larger scopes for which the other input generation techniques exhausted the allotted time or memory. TFB was able to analyze larger scopes than Gen&Filter in 7 out of 12 cases (remarkably, by at least 6 in AList, at least 3 in CList and at least 2 in AVL), and in 8 out of 12 cases with respect to Build\* (by at least 4 in all 8 cases). Routine extractMin in structure BHeap is particularly interesting: it contains a bug first found in [14] that can only be exhibited by an input with at least 13 nodes. Gray cells mark experiments in which the bug was detected by CBMC. Notice in particular that Build\* does not scale well enough to find this bug.

Our second evaluation is devoted to tight field bounds computation, in comparison with the top-down approach presented in [14]. We re-ran the TACO experiments as reported in [13] on the same hardware we used for our own experiments for a fair comparison. Original scripts and configurations were preserved. All distributed experiments were run on a cluster of 9 PCs (one being the master) of the same characteristics as described above. Each distributed experiment was run 3 times; the reported timing is the average thereof. All times are given in wall-clock seconds. A timeout (TO) is set at 18,000 seconds (5 hours), for tight bounds computation. Our bottom up tight field bounds technique is non-parallel, and was run on a single workstation. Table 2 summarizes the results of our experiments regarding tight bounds computation. We compared the running times of computing tight field bounds using the distributed technique from [14] and our non-parallel presented algorithm, for scopes 10, 12, 15, 17 and 20, reporting the following:


The speed-ups obtained by Alg. 6 are, in comparison with the distributed approach in [14], in general very good. In particular, in all experiments but AVL with scope 20, the running time of our sequential bottom-up approach (BU) is already below the wall-clock time of (parallel) TACO. For AVL trees with scope 20, the only experiment where BU performed slower than TACO, the achieved speed up is 0.6X. This means that running BU on a single workstation does not even take twice as long as running TACO(||) on 32 processors (4 cores in 8 slave machines used for distributed computation). Second, it is worth noticing that structures with strong invariants (e.g., BHeap) intuitively lead to "small" tight field bounds; a bottom-up approach then, as we explained earlier, is particularly well suited for tight bounds computation for these structures, since the process


Table 1: Dynamic data structure verification in CBMC: TFB versus Build\* and Gen&Filter. Verification and solving times in seconds, clauses and variables in thousands

of computing bounds by discovering and adding new elements to a partial bound until nothing new can be discovered, quickly converges to termination in these

Table 2: Tight field bounds computation times and achieved speed-ups.


cases. Third, some structures with relatively weak invariants also had good running times (AList, in particular), when compared to other case studies. Although the invariants in these cases are weaker, which intuitively would lead to more expensive tight bounds computations, these structures have fewer fields, so the state space to be covered to compute tight bounds is significantly smaller than that of more complex structures.

All the experiments in this section can be reproduced following the instructions available at [1].

*Threats to Validity.* Our experimental evaluation is limited to data structures. From the vast domain of data structures, we have selected a few ones that we consider representative for several reasons: they are often used as case studies in the evaluation of other software analysis tools [6, 9, 18, 28], their invariants have varied complexity (which is a dimension that affects tight bounds' size, and thus their computation), some are acyclic and others are not (which shows that the encoding we make in CBMC is quite general), etc. We consider this is a good menu, representative of a wider class of data structures.

Our approach to capture both the Build\* and Gen&Filter strategies might have accidentally favored our technique. We tried different alternatives for capturing Build\* and Gen&Filter, in particular with different ways of writing the repOK routines (which affected running times). We took the best alternative found for each case, to perform the comparison. In the case of Build\*, we took the smallest number of builder routines that guaranteed producing *all* (bounded) structures, since this is a factor that impacts running times. All structures with the exception of CList and TSet required just the add routine, while these two also needed a remove routine, to guarantee generation of all structures.

Regarding variance across cluster runs, different schedulings indeed yield slightly different timings. Since the granularity of individual analyses is fine, differences are typically small. However, they grow with the scope (e.g., usually smaller than 5% for scope sizes below 10, but up to 15% for the largest sizes). We used the average of 3 runs to reduce the effect of variance in the experiments.

Finally, we did not prove our implementations correct, so our results may be affected by errors in our implementations. We checked consistency of the results across different techniques and tools to confirm that bounds were correctly computed, and verification was bounded complete in all cases.

# **6 Related Work**

Automated analysis of code handling dynamic data structures has been the focus of various lines of research, including separation logic based approaches [5], approaches based on combinations of testing and static analysis [22], various forms of model checking including explicit state model checking [27], symbolic execution based model checking [23] and SAT-based verification [14, 13]. The approach that we refer to as Build\*, producing nondeterministic structures by using insertion routines, has been used in some of these approaches, including [22, 11]. The "generate & filter" mechanism, on the other hand, is more often employed in modular (assume-guarantee) verification. In particular, the *lazy initialization* approach, whose symmetry breaking we borrowed for "generate & filter" in this paper is used in [19], among others. However, in SAT-based bounded model checking, with tools such as [20], "generate & filter" is not reported as an analysis option for dynamic data structures. The use of tight bounds to improve analysis has been used previously to improve test generation and bounded verification for JML-annotated Java programs [14, 13]. The setting is however different from that of CBMC, due to the relational program (and heap state) semantics, which enabled them to exploit tight bounds directly at the propositional encoding level. Tight bounds have also been used for improving symbolic execution based model checking [15, 26]. Again, the context is different, since these approaches that essentially "walk" the code (either concretely or symbolically), can exploit tight bounds more deeply [26], also obtaining greater profits.

We have also reported a novel technique to *compute* tight bounds. This algorithm is inspired in the work of [24] about black-box test input generation using SAT. Our work is also closely related to [14, 13]. The approach to compute tight field bounds presented in [14, 13] as part of the TACO tool, performs a very large number of independent SAT queries to compute bounds, and thus requires a cluster of workstations to do so effectively (we compared with this approach in the paper). Another alternative approach to compute tight field bounds is presented in [25], but requires structure specifications to be provided in a Separation Logic flavor [21] to compute field bounds.

#### **7 Conclusions**

We have investigated the use of tight field bounds in the context of SAT-based bounded model checking, more concretely, in (assume-guarantee) verification of C code, using CBMC. We showed that, in this context, and in particular in the verification of programs dealing with linked structures, an approach based on nondeterministically generating structures, and then "filtering out" ill-formed ones, can be more efficient than the more traditional approach of repeatedly using data structure builders, especially when tight bounds are exploited. We have performed a number of experiments that confirm that this alternative approach allows CBMC to consider larger input sizes as well as to detect bugs that could not be detected without using bounds.

Since the approach depends on precomputing tight field bounds, we have also studied this problem, providing a novel algorithm for tight field bound computation. Tight field bounds have proved useful for a number of different analyses, but computing them is costly, and previous field bound computation approaches that performed reasonably did so at the expense of relying on a cluster of workstations to perform the task, or were only applicable to a limited set of class invariants, expressible in separation logic. Thus, while tight field bounds proved to have a deep impact in the previously mentioned automated software analysis techniques, their use has been severely undermined by the necessity of a cluster of computers for their effective computation, or the availability of specifications in separation logic. The algorithm presented in this article allows one to compute tight field bounds on a single workstation more efficiently than the distributed approach on a cluster of 8 quad-core, and therefore makes tight field bounds computation both practical and worthwhile, as part of the above mentioned analyses.

# **References**


*6th International Conference, SAT 2003. Santa Margherita Ligure, Italy, May 5-8, 2003 Selected Revised Papers*, volume 2919 of *Lecture Notes in Computer Science*, pages 502–518. Springer, 2003.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If

# Effects of Program Representation on Pointer Analyses — An Empirical Study

Jyoti Prakash(-)<sup>1</sup> , Abhishek Tiwari<sup>2</sup> , and Christian Hammer<sup>1</sup>

> <sup>1</sup> University of Potsdam, Potsdam, Germany jyotiprakash1@acm.org, c.hammer@acm.org <sup>2</sup> National University of Singapore, Singapore, Singapore tiwari@comp.nus.edu.sg

Abstract Static analysis frameworks, such as Soot and Wala, are used by researchers to prototype and compare program analyses. These frameworks vary on heap abstraction, modeling library classes, and underlying intermediate program representation (IR). Often, these variations pose a threat to the validity of the results as the implications of comparing the same analysis implementation in different frameworks are still unexplored. Earlier studies have focused on the precision, soundness, and recall of the algorithms implemented in these frameworks; however, little to no work has been done to evaluate the effects of program representation. In this work, we fill this gap and study the impact of program representation on pointer analysis. Unfortunately, existing metrics are insufficient for such a comparison due to their inability to isolate each aspect of the program representation. Therefore, we define two novel metrics that measure these analyses' precision after isolating the influence of class-hierarchy and intermediate representation. Our results establish that the minor differences in the class hierarchy and IR do not impact program analysis significantly. Besides, they reveal the sources of unsoundness that aid researchers in developing program analysis.

Keywords: Pointer Analysis, Java, Program Analysis, Empirical Studies

### 1 Introduction

Researchers have proposed various approaches to enhance the precision and soundness of static analyses [6, 9, 10, 14, 17, 26, 30, 31]. They use program analysis frameworks to prototype and evaluate their algorithms. A program analysis based on declarative specifications (a growingly popular implementation paradigm) uses these frameworks to extract fundamental dataflow relations and feeds them as the ground facts to a Datalog engine.

Program analysis frameworks, primarily Soot and Wala, are being increasingly adopted in program analysis [11, 31, 40]. These frameworks provide APIs, which abstract internal program representation. However, program representation in these frameworks is heterogeneous in many aspects. A few of those are:


These factors influence the precision, scalability, soundness of the analyses, and at the same time, impede a fair comparison of analyses. Earlier research (Späth et al. [29]) was concerned about the validity of results when comparing two analyses frameworks. Reif et al. consider the comparison of different frameworks "bogus" [21] at the outset. Although many earlier works have proposed techniques to enhance scalability and precision, little to no work was done on how program representation influences program analyses. As a result, a comparison of new analyses with existing analyses suffers from a threat to validity that might have been overlooked. In this work, we fill the gap with an empirical study of these aspects of program analysis frameworks.

We choose pointer analysis for this study. Pointer analysis computes the heap locations referred by program variables and builds the foundation for many others, such as alias analysis, type-state, or program slicing. To evaluate intermediate representation and library modeling, we choose Doop, a *state-of-the-art* pointer analysis framework and compare its analysis for different frontends. For the third aspect, heap modeling, we compare the pointer analysis of *Wala*'s (a *state-of-the-art* program analysis) framework with *Doop* using *Wala*'s frontend, i.e., leveraging the identical intermediate representation.

A challenging aspect of this work is that the existing notions of precision for pointer analysis were insufficient. The computation of these metrics does not isolate single aspects of pointer analysis but rather combines all effects. For example, the average points-to set size is influenced by all three of the aforementioned aspects. It is difficult to determine the effect of each aspect by only looking at the score. In this work, we counteract this problem by introducing metrics that isolate a particular aspect under study and nullifies the effect of others. Therefore, we define two novel metrics in section 3.1, one for measuring the effects of libraries to enable a fair comparison among frameworks. To the best of our knowledge, it is the first study that evaluates the impact of program representation on pointer analysis. Precisely, in this paper, we make the following contributions:


In summary, our empirical study dispels the threats to the validity of the results of existing works posed by these differences of frameworks. It also discovers novel sources of unsoundness and imprecision in existing frameworks that provide suggestions that users/developers of these frameworks could incorporate into their analyses. Although we focus on pointer analysis in the paper, our results are, in principle, generalizable to many other static analyses, as the findings presented in this paper also hold for these. We have made the artifacts available on https://github.com/jpksh90/pointeval to facilitate reproduction.

### 2 Background and Motivation

The goal of pointer analysis is to determine which objects a variable may refer (point) to at runtime. A *points-to set* is a static approximation of this question, which maps variables to objects that are allocated on the heap (heap objects). More precisely, if V is the set of variables in a program, and H is the set of heap objects, then *points*-*to* : V → P(H). *points*-*to*(v) returns the set of heap objects in H referred by v.

Doop is a framework that exclusively focuses on pointer analysis, defines the analysis' inference rules in Datalog [41], and is in active development. It supports tuning of the analysis to adapt for various factors of precision (and scalability). Doop leverages the program synthesizer Soufflé [12, 22] to resolve *points*-*to* according to the inference rules and the ground facts, which are derived directly from the program.

Wala [37] and Soot [28] are general-purpose program analyzers providing some pre-defined analyses and APIs for the development of custom analyses. Wala comes with various pre-defined pointer analyses [39], some of which feature novel optimizations to enhance scalability.

A *context-sensitive analysis* improves a pointer analysis' precision by discerning method calls based on their calling contexts. Popular notions of contexts are based on method callsites [23] (*callsite-sensitive*), invoking objects (*objectsensitive*) [19], or hybrids thereof [13].

In the sequel, we explain the need for this study by exemplifying the three factors that influence the results of pointer analyses.

Listing 1.1: Factory Method

```
1 public class Factory {
2 public static void main(String args[]) {
3 AInt a = AInt.getInstance(5);
4 AInt b = AInt.getInstance(7); } }
5 class AInt {
6 private Integer a; // ... getter , setter and constructor
7 public static AInt getInstance(int x) {
8 return new AInt(x); //allocation a@8
9 }}
```
Listing 1.2: Soot IR for the *main* method in Listing 1.1

```
1 public class Factory extends java.lang .Object {
2 //constructor
3 public static void main(java.lang .String[]) {
4 java.lang.String[] r0;
5 AInt r1, r2;
6 r0 := @parameter0: java.lang.String[];
7 r1 = staticinvoke <AInt: AInt getInstance(int) >(5);
8 r2 = staticinvoke <AInt: AInt getInstance(int) >(7);
9 return ;}}
```
#### 2.1 Intermediate Representation

Many program analyses tools leverage an *intermediate representation* (IR) instead of the actual source or bytecode for analysis. IRs remove syntactic sugar from the source code and make it amenable to analysis by focussing on the fundamental operations. Popular strategies for IR generation are based on threeaddress code or *static single assignment* (SSA) form [4]. By default, the Soot framework uses a three-address-based IR (*Jimple*) [35], while Wala uses a SSAbased IR [38]. Both IRs are register-based [36,38], and hence introduce synthetic variables to mimic the stack-based Java bytecode. *Doop* can be configured to leverage either *Jimple* or *Wala*'s IR as a frontend for program representation.

Consider the code example in Listing 1.1 and its Jimple IR depicted in Listing 1.2. The *main* method declaration (line 2) translates to the almost identical line 3 in the IR, whose parameter is translated to the variable *@parameter0* (line 6). Due to the additional local variable *r0* (line 4), the single main method argument translates to two variables in the IR. The invocations of the static method *getInstance* (lines 3 and 4 of Listing 1.1) are translated to the corresponding operation code *staticinvoke* with the method name and arguments. The newly allocated objects returned from these factory method invocations are stored in the variables *r1* and *r2*.

Wala's IR generation differs significantly from Soot (see Listing 1.3). As an SSA-based IR, it does not assign names to method parameters and variables but ordinal numbers (starting from '*1* ') called *variable numbers* (we prepend '*v*' to these numbers for clarity). Thus, the receiver object (*this* reference in Java), or the first parameter in the case of a static method is (silently) assigned

Listing 1.3: Wala IR for the *main* method in Listing 1.1

```
1 Factory.main([Ljava/lang/String;)V
2 5 = invokestatic < Application , LAInt ,
     getInstance(I)LAInt; > 3 @1 exception:4
3 8 = invokestatic < Application , LAInt ,
     getInstance(I)LAInt; > 6 @7 exception:7
4 return
```
Listing 1.4: Snapshot of pointer analysis results from Doop with different IR

```
1 // Variables in main method with ****Wala****
2 < <<main method array >> <Factory: void
    main(java.lang. String []) >/v1
3 // Variables in main method with ****Soot****
4 > <<main method array >> <Factory: void
    main(java.lang. String []) >/@parameter0
5 > <<main method array >> <Factory: void
    main(java.lang. String []) >/l0#_0
```
the number *v1*. Further method parameters are assigned subsequent variable numbers, succeeded by local variables. Again, the static method calls to the method *getInstance* are translated to *invokestatic*, where *v3* and *v6* hold the (implicitly defined) constant arguments 6 and 7. The objects returned from the factory method invocations are stored in the variables *v5* and *v8*. Potential exceptions thrown in the invoked methods are stored in *v4* or *v7*, respectively.

The differences in program representation influence the metrics of pointer analysis: We analyzed Listing 1.1 context-insensitively with Doop, using Jimple and Wala's IR. The results are shown in Listing 1.4: The main method parameter object «main method array» is referred by one variable in Wala (line 2) but two variables in Soot (lines 4– 5). Even though the average points-to set size is 1 for all variables in Listing 1.4, we found noticeable differences in the average points-to set sizes in other program's analyses, with Soot's frontend the average size of the points-to set being 2.07 for 3328 variables, and 1.95 for 2298 variables using Wala's—Jimple again created more variables than Wala. These subtle differences in program representation affect the average points-to set size, and it is unclear whether these two numbers are in fact comparable. In this work, we aim to investigate the impact of IRs on the precision and scalability of the analysis (Section 4.3).

#### 2.2 Static modeling of libraries

As a whole program analysis, a pointer analysis does not only requires knowledge of the program to be analyzed but also the library classes, especially those related to the runtime. For example, a whole program analysis of a Java application would require the runtime libraries, such as those in *rt.jar*, and other dependent libraries, bundled with the application. Analysis frameworks such as Soot and Wala construct the class hierarchy based on all classes present in libraries and the application. They can also remove "irrelevant" classes, favoring scalability over soundness. Interestingly, we found cases where some frontends do not load all of the required classes, which induces discrepancies when comparing the analyses.

Consider the program shown in Listing 1.1. To corroborate our intuition, we analyzed this program context-insensitively with Soot's and Wala's frontends. Using the former front-end, Doop loads 3,837 classes and computes the analysis with an average points-to set size of 2.07. With Wala's front-end, it loads 19,927 (~5×) classes for analysis with an average points-to set size of 1.95. Further investigating the types of heap objects, we found that Doop with Wala's IR contains objects of the class *java.security.PrivilegedActionException*, which is absent in the analysis with Soot. Note that our simple program contains no instance of that type, so it must stem from analyzing libraries. In another instance, Soot loads the classes from *javax.crypto*, whereas Wala does not. In this research, we examine the imprecise modeling and discover possible implications on precision and soundness (sections 4.1 and 4.2).

#### 2.3 Heap Abstraction

Heap abstraction is an important aspect of pointer analysis and determines how object allocations are statically represented in the analysis. One simple approach is to create a unique representation for each object allocation site in the program (*allocation site abstraction*). However, at runtime allocation sites can be executed more than once, creating several objects that are then represented by the same abstract value. As an example, consider the object allocation (line 8) of Listing 1.1, represented via a single abstract object, say *a@8*. In the *main* method the newly allocated objects returned by *getInstance* are captured by the variables *a* and *b*, which would both refer to the abstract object, *a@8* in the result of the pointer analysis. Thus, *a* and *b* are spuriously considered *aliases* (i.e., refering to the same object.) This imprecision stems from ignoring the calling-context of *getInstance* (*context-insensitive heap abstraction*).

A *context-sensitive heap abstraction* (a.k.a *heap cloning*) discerns the abstract<sup>3</sup> heap-objects based on the calling context, associating the calling context with the heap object to distinguish the allocations in a pair *allocation site*, *call stack*. Thus the allocation at line 8 is represented as two heap objects, *a@8*, 3 and *a@8*, 4. Without loss of generality, the length of the call stack can be increased to any finite number, lest the analysis be undecidable. All *state-of-the-art* pointer analysis frameworks offer *context-sensitive heap abstraction* with a finite context length.

The discussion above demonstrates how the choice of heap abstraction can (potentially) influence pointer analysis. Therefore, in this work, we study the frameworks' heap abstractions. We conducted a preliminary study to gain initial insights and to validate our intuition, and context-sensitively analyzed Listing 1.1 with a *one-call-site context-sensitivity* in Doop with Wala's IR, and the *one-call-site sensitive* analysis of the Wala framework. Both of these analyses

<sup>3</sup> In the sequel we will reference abstract heap objects as heap objects for brevity.

use a context-sensitive heap abstraction with context length of one. In spite of that, Wala creates 17 objects while Doop creates 133 objects (~7×). The average points-to set size varies between 1.55 for the analysis provided by Wala and 1.62 for Doop with Wala's IR<sup>4</sup>. Thus, we can see that even with the same level of sensitivity in heap abstraction (and IR), analysis results depend on the framework used. Manual inspection revealed that Wala selectively uses the context-sensitive heap abstraction, applying contextual heap abstraction only to non-library classes while treating the library's objects context-insensitively. Out of the 17 heap objects, Wala uses context-sensitivity for only 6 objects. In contrast, Doop leverages context-sensitivity for all heap objects, including the library's objects. These initial insights motivated us to analyze the influence of heap abstraction on precision and scalability in more detail in Section 4.4.

To summarize, the parameters for program analysis such as IR (Section 2.1), static modeling of libraries (Section 2.2), and heap abstraction (Section 2.3) affect the precision and scalability of a pointer analysis. Based on initial insights, we analyze the influence of the mentioned parameters using different frameworks, frontends, and on a larger and diverse set of benchmark applications.

### 3 Methodology

#### 3.1 Metrics Used

The precision of a pointer analysis has been defined in numerous ways in the literature. Some of the metrics for precision available in the literature are the average size of the points-to sets, the number of call-graph edges, and the number of resolved virtual calls. These metrics are not clearly superior to one another but rather tailored to specific clients, for example, the latter is leveraged by compilers in *devirtualization* of virtual method calls.

All of these metrics reflect how precisely the analysis computes the points-to sets (sets of heap objects referred by a variable). For example, whether or not a virtual call can be resolved depends on the heap objects' types in the points-to set of the target variable. If there is only one type (or subtypes thereof that do not redefine the virtual method) then the virtual call is resolvable. Therefore, the precision of a client analysis depends on how precisely the points-to set for each variable in the program can be resolved, in other words, how low the value of the average points-to set size is. An average size close to one is considered the hallmark of pointer analysis [27].

Therefore, we leverage the wide-spread metric of average points-to set size for our evaluation, i.e., the ratio of the total sizes of the points-to sets to the total number of local variables [26,34]. It permits a client-agnostic comparison of the pointer analysis, which generalizes our evaluation results to any specific analysis. We refer to the average points-to set size as *precision* in this paper. Note that the actual precision of the analysis is inversely connected to the average points-to

<sup>4</sup> Note that due to context-sensitive analysis, the average points-to set size is better than that mentioned in sections 2.2 and 2.1.

set size: A lower precision value (i.e. average points-to set size) implies a higher precision of the computed analysis result, as precise analyses aim at excluding unrealizable (at runtime) allocation sites from the points-to sets of variables.

An IR may create many synthetic variables, among other reasons for method parameters or for φ-nodes at control-flow joins of SSA-form. For example, threeaddress code re-uses the same variable in assignments in the *if* and *else* blocks of a conditional. However, SSA-based IRs insert a synthetic variable in a φ-node at the control-flow join to select one of the distinct variables of the respective blocks. The presence of synthetic variables in IRs impedes the comparison of different analyses using the average points-to set size, as averages depend on the (unequal) number of variables. Therefore, we devise heuristics to establish comparability of our metrics for different IRs.

Another challenge in this work is inferring the impact of each analysis parameter on its precision. Computed at the end of the analysis, the average points-to set size loses information on the contribution of a particular aspect of pointer analysis. Therefore, we require a fine-grained metric to quantify the precision for each parameter. We propose two such techniques, one for the class hierarchy and the other for the intermediate representation.

Class Hierarchy The analysis of the program's class hierarchy builds the foundation for inferring relevant variables and heap allocations. However, each framework leverages a particular strategy to infer classes that contribute to the program's semantics. Adding irrelevant classes to the class hierarchy may manifest into a synthetically precise analysis, as these classes add to the total number of variables (which will all be pointing to an empty set), thus potentially decreasing the average size of points-to sets. Some of these variables and heap allocations are not part of the actual code executed at runtime, but rather arise out of an imperfect model of the program analysis framework's frontend. Here, we study the variables and heap objects stemming from the additional classes exclusive to a framework.

We first instrument the *Doop* framework to log the class hierarchies and compare the class hierarchies obtained using Soot and Wala as frontends, which yields the classes exclusive to each of the frameworks. *CH soot* and *CH wala* denotes the set of classes in the class hierarchies of Soot and Wala respectively. *CH common* = *CH soot* ∩*CH wala* is the set of classes common to both frameworks. We define *CH-precision* in terms of the average points-to set size restricted to variables defined in methods of *CH common*.

Definition 1. CH-Precision *(CP). Let* V<sup>c</sup> <sup>f</sup> *be the set of variables defined in methods of CH common for the frontend* <sup>f</sup> ∈ {soot, wala}*, and* <sup>H</sup><sup>c</sup> <sup>f</sup> (v) = {h | h ∈ *points*-*to*(v), v <sup>∈</sup> <sup>V</sup><sup>c</sup> <sup>f</sup> }*. CH-Precision is the ratio of* <sup>H</sup><sup>c</sup> <sup>f</sup> *and* V<sup>c</sup> <sup>f</sup> *, i.e.,*

$$CP\_f = \frac{\sum\_{v \in \mathcal{V}\_f^c} \mathcal{H}\_f^c(v)}{|\mathcal{V}\_f^c|}$$

If an analysis does not contain any exclusive classes or all of their variables (and corresponding heap objects) belong to the types present in the set of exclusive classes, CH-precision equals the average points-to set size.

Intermediate Representation (IR) The choice of IR determines a program's representation but retains the program's semantics, particularly with respect to heap allocations. Thus, different IR's can differ in the number of variables but will not introduce additional heap objects (e.g. Listing 1.4).A fundamental difference between Soot's Jimple and Wala's SSA-based IR is that SSA creates unique variables for each variable definition, while three-address code does not. Rendering our precision metric comparable for structurally different IRs is challenging, as tracking which variables correspond to each other is technically involved and may not be unique. Therefore, we rely on a heuristic to determine comparable variables. We motivate the heuristics considering two different IRs for the *main* method in Listing 1.1. Jimple (Listing 1.2) defines four variables, r0 – r2, and parameter0, while Wala's IR (Listing 1.3) defines three variables: v1 (implicit, not shown in the listing), v5, v8.

Definition 2. *Defm denotes the set of variables defined in a method. Defm*(*m*, *ir* ) = *si*∈*Sm*,*ir def* (*si*)*, where* <sup>S</sup>m,ir *is the set of statements in method m for ir , def* (si) *the variables defined in* si*.*

Definition 3. Interesting Method*. A method* m *is interesting if* |*Defm*(m, *wala*)| = |*Defm*(m, *jimple*)| *and* m *is defined in class* C ∈ *CH common, i.e., the number of variables defined in the method with the same signature vary for different IRs.* M *denotes the set of interesting methods.*

To determine the set of interesting methods (M) we leverage the logs from pointer analyses and segregate the variables in the logs according to the declaring method (*m*). If the sizes of the corresponding sets differ for a method m, it is considered interesting. (M is confined to the set of methods defined in *CH common* to exclude the exclusive classes.) Subsequently, we determine the points-to relation for the variables in M.

Simple average of the heap objects and number of variables is insufficient for comparing the precision of the analysis between two IRs. Differences in class hierarchies and aliasing generates new variables, which makes the ratio incomparable if the heap objects are not same. To circumvent this problem, we combine average points-to set size with ideas from virtual call resolution. The number of virtual call sites in a program is identical irrespective of the differences in program representation (caused by aliasing and redundant variables). Therefore, we receive a fair comparison if we restrict the average point-to set size to the target variables of virtual method calls. We define a new metric, *average devirtualized heap objects* (H<sup>f</sup> <sup>v</sup> ), which is the ratio of the total size of points-to sets of target variables at the virtual call sites to the number of virtual call sites.

Definition 4. Average devirtualized heap objects *(*H<sup>f</sup> <sup>v</sup> *). For the set of virtual call-sites* C *in the IR of a framework* f *and* VC,f *as the set of invoking variables* *at* C*, let* H<sup>v</sup> = *points*-*to*(v) *be the set of heap objects referred by* v ∈ VC,f *. Average devirtualized heap objects is*

$$H\_v^f = \frac{\sum\_{v \in V\_{C,f}} points \text{-}to(v)}{|C|}$$

Based on the above discussion, we formulate and answer the following research questions:

RQ1. How does the class hierarchy vary with the benchmarks?

RQ2. How do differences in class hierarchies affect the precision of analyses?

RQ3. How do the choice of IR affect the precision of the analysis?

RQ4. How do the heap abstractions differ between pointer analysis frameworks?

#### 4 Evaluation

We use Doop version *4.20.7-67* and *Wala* version *1.5.0*. For RQ1-RQ3, we invoked Doop with the following analysis options: 1-call-site-sensitive, 1-object-sensitive, 2-call-site-sensitive+heap, 2-object-sensitive+ heap. Specific options used in our study for each research questions are described in their respective sections. We use the DaCapo [2] (version 9.12-bach) benchmarks, a standardized suite of open-source Java applications, for our study.

#### 4.1 RQ1: Class hierarchy differences with benchmarks

We captured the class hierarchies considered by the analyses to determine the differences. We instrumented *Doop* to log the classes considered during a (contextinsensitive) analysis, which yields the complete class hierarchy. In order to investigate whether the class hierarchy depends on the frontend, we performed this experiment with Soot and Wala as frontend<sup>5</sup>. Table 1 lists the differences in the class hierarchies using Soot and Wala. On an average, Wala exclusively contains ~13,994 classes in its class hierarchy. The number of classes exclusive to Wala range from 12,524 (Xalan) to 16,707 (Tradebeans). Soot's class hierarchy on average contains 26 classes not present in Wala's, ranging from zero to 62.

In the case of PMD and H3, Soot's class hierarchy contains only a single additional class, Jython has an additional 2 classes. Eclipse, Lusearch, and Luindex contain 62, 53, 53 additional classes, respectively. In the remaining cases the class hierarchy in Soot is strictly a subset of Wala's. In next RQ, we will study the impact of these additional classes on the precision and scalability of the analysis.

#### 4.2 RQ2: Precision differences with class hierarchy

<sup>5</sup> Note that Soot and Wala provide options to exclude certain classes from analysis (to, e.g., exclude library classes). For a fair comparison we ignore this feature and compute the whole class hierarchy including libraries.


Table 1: Difference in classes considered by Soot and Wala. Last two columns show the extra classes loaded by Soot and Wala respectively.

Study Setup We have used the *var-points-to* relation, which maps all variables and context pairs to their resolved pairs of heap-object and context. We select those variables that originate from classes common to both frameworks (Section 4.1) and query their points-to information. We then compute the *CH* − *Precision* based on Definition 1.

Results Table 2 presents the results of the analysis (for one-callsite, one-object, and two-object context-sensitivity) for the objects and variables belonging to exclusive classes present in Wala (only non-zero values included). Note that the two-object sensitive analysis did not terminate for Eclipse and Jython, therefore, these are not presented in the table. In one-callsite and one-objects analysis, Table 2 lists six out of eleven benchmarks contain variables that belong to the exclusive class hierarchy. The remaining benchmark applications show no differences in the number of variables and heap-objects, despite the presence of additional classes. It demonstrates that the additional classes loaded by the these frameworks have no influence on the precision of these benchmarks.

The third and fourth columns of Table 2 list the number of variables (in principle, variable-context pairs) and heap objects belonging to the set of exclusive classes, respectively. In all analyses, all but one benchmark have a higher average points-to set size for exclusive variables than the general average. Tradebeans only creates 3 additional heap objects with Wala' frontend, therefore the analyses are almost identical for both frontends. The average points-to sets for exclusive classes for bigger benchmarks such as Eclipse and Jython are outliers, showing very high averages. Still, the contribution of exclusive classes' heap objects and variables is negligible compared with the total heap objects of these benchmarks.

The eighth and ninth columns depict the CH-precision and the original precision for the analyses. We observe that the CH-precision is slightly lower than the precision for all benchmarks but tradebeans, which originates from the addi-


Table 2: Differences in precision in the presence of additional objects in class hierarchy (Wala). HO denotes the sum of number of heap objects in *points*-*to* set. *CPwala* is the precision score for variables in *CH common*.

Table 3: Differences in precision in the presence of additional objects in class hierarchy for Eclipse (Soot).


tional heap objects and variables. These primarily belong to the internal libraries such as *sun.util, sun.util.resources* (discussed later).

With the Soot frontend (Table 3), the *CH-Precision* differs from *Precision* only for the benchmark Eclipse, for the other benchmarks the analysis does not contain any objects where the type belongs to the exclusive classes of the frontend. However, it is difficult to compare the precision of Soot v/s Wala on CH-Precision score due to differing variable numbers for the same benchmark application.

Finding 1: *Differences in class-hierarchy negligibly impact the pointer analysis precision (and thus client analyses).*

*Soundness* In our observation, the Wala frontend takes the internal Java libraries into account. We find heap objects belonging to libraries such as *sun.nio.fs*, *sun.util.resources*, *sun.security*, and *sun.nio.cs*, which are internal libraries used by the JVM. Soot, on the other hand, does not model these libraries for analysis.

Comparing the class hierarchies of the analyses using Soot and Wala, we observed that the class hierarchy using Soot as frontend is a subset of Wala's for all


Table 4: Total (for each framework) and interesting (section 4.3) methods M.

benchmarks except Eclipse. This suggests that analyses with Soot are as sound as analyses with Wala for all benchmarks except Eclipse. Eclipse is a compelling case: Its analysis using Soot contains heap objects and variables that belong to the internal libraries of Eclipse, such as *org.eclipse.core.internal.runtime.PerformanceStatsProcessor*, while the analyses with Wala does not report these objects. However, results from the analyses with Wala contain heap objects from the internal libraries such as *sun.util.\**, which are not present using Soot. It shows that the class hierarchy model is unsound in both frontends, as both lack some of the classes loaded by these benchmark applications at runtime.

*Our study reveals that library modeling in both Soot and Wala is* unsound *even for (non-native) Java objects, shown by the presence of heap-objects belonging to the exclusive classes of Soot and Wala.*

### 4.3 RQ3: Precision for IR varies with the framework

Study Setup The study setup is similar to Section 4.2. We use the application's var-points-to sets, i.e., the relation of variables and heap objects excluding the library objects. From the results of the three analysis sensitivities, we extract the set of interesting methods (M, Def. 3) and compute the average devirtualized heap objects score for the virtual calls in interesting methods. We use the Jimple IR (--no-ssa option in Doop), and Wala's IR (--wala-fact-gen option in Doop) for evaluation.

Results Table 4 reports the number of interesting methods and total methods resolved using both frontends. Note that the number of interesting method is identical for both frameworks for the same type of context-sensitivity. The number of reachable methods in each analysis differs, just as the number of distinct methods signatures discovered in each framework (columns Soot, Wala in 1-CS, 1-OS, 2-OS<sup>6</sup>). However, deriving a relationship between those is impossible, as

<sup>6</sup> We excluded 2-CS for its large file sizes.


Table 5: Results for IR. Third and fifth columns are the number of heap objects. Fourth and sixth columns are the number of virtual calls. Last two columns lists

analyses such as one-call-site and one-object are not comparable. In all cases, we observed that the majority (~90%) of the methods are interesting. Therefore, we cannot ignore the significance of this aspect.

*Interesting methods are difficult to ignore because of their sheer presence in the benchmarks applications.*

Table 5 presents the differences in the average devirtualized heap objects for Jimple and Wala IR. Although the number of variables and abstract heap locations are dependent on the IR, we did not observe many differences between those when restricting ourselves to target variables of virtual method calls, which corresponds to our intuition. The differences in the H<sup>f</sup> <sup>v</sup> values for both IRs


Table 6: Differences Soot IR v/s Wala IR for Xalan

Listing 1.5: Differences in types of heap objects created in both analysis

```
1 (Wala) sun.misc.URLClassPath$Loader
2 (Wala) java.util .zip.ZipError
3 (Soot) javax.xml.transform.FactoryFinder$ConfigurationError
```
are negligible except for three larger benchmarks, Jython, Eclipse, and Xalan. Overall, the values from Soot IR were smaller than those of Wala, implying that *devirtualization* in Soot is either slightly more precise or slightly less sound than in Wala, however, the differences are minor in the majority of the cases. In conclusion, the choice of IR shows little to no impact on the precision of pointer analysis. In the sequel, we describe one such case study where the difference in H<sup>f</sup> <sup>v</sup> is approximately two, which is a significant figure as compared to others.

Finding 2: *IR has negligible impact on the precision of pointer analysis at least for the devirtualization client.*

*Case Study—Xalan* To further investigate the differences, we chose *Xalan* using a one-call-site analysis as the H<sup>f</sup> <sup>v</sup> values for Soot (7.45) and Wala (9.44) display the highest difference among all benchmarks. The number of heap objects in both cases differs significantly, with Soot having 43K heap objects, and Wala having 55K heap objects for a comparable number of virtual calls (5,832 vs. 5,850).

To examine the heap objects, we collected their class types. We observed that the types of some of these objects belongs to the classes in *CH soot*\*CH common* or *CH wala* \*CH common*. Listing 1.5 depicts the differences in heap objects created by these frameworks.

We also discovered (potential) sources of imprecision and unsoundness in both analyses. Table 6 lists methods and exceptions missed by both Soot and Wala frameworks. Note that these methods and exceptions belong to the common class hierarchy. We observed that Wala has precise exception modeling compared to Soot. For other virtual methods invocations, we compared the runtime call-graph to the static call-graph. In our observation, both Wala and Soot are unsound, as demonstrated by the absence of certain method calls in the callgraph for both analyses. In addition, Wala imprecisely includes xerces.xml.dtd. XMLDTDLoader() into its call-graph (which at least in our experiments was not executed at runtime).

*Apart from reflection, imprecise/unsound virtual call resolution also induces imprecision/unsoundness into the analysis.*

#### 4.4 RQ4: Heap abstractions in pointer analysis frameworks

In this section, we compare Doop's analysis using Wala's frontend with Wala's own analysis. We omit the comparison with the Soot framework as it leverages IRs different from Wala's and thus would not be comparable.

Study Setup We compare the one-call-site sensitive with contextsensitive heap abstraction (unique heap objects for each call-site, heap cloning) analysis available in the Table 7: Number of Heap objects


Wala framework with a one-call-site with one-level heap abstraction in Doop, and set the time budget to 7 hours. Analyses with a higher level of call-site sensitivity were not scalable in the Wala framework and therefore, we do not leverage those. Other optimizations in Wala, such as the use of object-sensitivity only for collection objects, are not comparable to the object-sensitive analysis available in Doop. Therefore, we also choose to ignore it. To handle reflective calls in Wala, we use the option REFLECTIONS.FULL. In what follows, we present the results of our study. We first present the differences in the number of heap objects and, subsequently, delve into its implications.

*Differences in the heap objects* For evaluation, we extracted the heap-objects created in Wala's and Doop's analyses and observe huge differences in the number of heap objects created. Intuitively, using the same level of heap-sensitivity (heapcloning) should create the same number of heap objects. However, in certain cases, the number of heap objects in Wala exhibits a factor of ~14 compared to those in Doop (columns 2 and 3 in 7). (Note that eclipse and jython are elided, as the analyses did not terminate within the time budget owing to the large file size (~100GB).) Therefore, the heap abstractions of these analyses are not comparable, although superficially they look similar.

*Subtle optimizations also manifests into imprecise heap modeling even though, at the outset, they look similar.*

To investigate this further, we compared the the types of the heap objects. Our study shows that the set of types are not even consistent using the same frontend! In many cases the types of objects analyzed by Wala is approximately four times those in Doop (columns 4 and 5 in Table 7). The differences in heap abstraction for application level objects build the reason for this.

*Application level objects* Application level objects, i.e., the heap objects created due to allocations within the program (rather than libraries.) In three out of eleven benchmarks we observe that Doop's analysis is lacking application level classes that Wala reports. We found corresponding allocations on a manual inspection of the source code. For example, in *avrora*, the analysis in Wala allocates heap objects of *BRNE\_builder* [8], which are not present in Doop's. Similar cases can be found in *PMD* and *Xalan*. However, owing to the limitations of the program representation, we could not determine the precise reason for the unsoundness. Pointer analysis uses an IR based on a control flow graph (CFG) rather than source code. Being a lower level representation of the program source code the IR mangles variables names. Therefore, a one-to-one correspondence between the IR's variables and variables in source code is not trivial.

Finding 3: *Heap modeling is not similar even for allocations within the application scope. Wala handles application levels objects more precisely than Soot in our evaluation.*

# 5 Threats to Validity

Naturally, the technique used relies on the precise handling of reflection calls and other dynamic features of the languages such as dynamic proxies. Other than that, handling of native calls could alleviate the unsoundness of the analyses. Analysis of native calls could infer the native objects in JVM missed by the Soot framework. Here, we have used the TamiFlex framework for handling reflection calls. Other approaches have improved the reflection handling [10,15–18,25]. To convince ourself, we experimented with one of the *state-of-the-art* techniques, i.e., reflection with matching substring resolution [10]. However, we did not find any significant differences in results. Another limitation of this study is the unsoundness from ignoring the native library calls in static analyses. Few of the sources of unsoundness discovered stem from the native calls. Recently, Fourtounis et al. [7] proposed a technique for resolving native calls in Java. However, at the time of writing this paper, the technique was not available. Further, our analysis in Section 4.3 is based on test-cases which may not reflect all possible executions of an application.

Our study also involves hours of manual evaluation which can be subject to bias. To counteract it, we did a manual inspection of the source code, especially for the sources of unsoundness. We had rerun the benchmark applications with valid inputs to determine to compare and reassert that the objects are actually allocated during runtime.

### 6 Related Work

*Pointer analysis tools* Pointer analysis has garnered significant interest in the last decades, focussing on scalability, precision, and soundness. The Doop system used in this paper results from years of research on declarative-style pointer analysis [1,3,10,24,26]. Similarly, the Wala framework was a result of an industrial project and, unlike Doop, follows an imperative paradigm. The underlying program representation comes with many prior assumptions mentioned. In this work, we study the effects of these assumptions on program analysis.

*Empirical studies on pointer analysis* Recent empirical studies focussed on the soundness limitations from dynamic features of languages in existing pointer analyses and call-graph construction as pointer analysis and call-graph construction are closely related static analyses and are mutually dependent. Dietrich et al. [5] proposed automated and manual techniques to generate unsoundness oracles to test static analysis. Sui et al. [32] present the causes of unsoundness in static analysis frameworks (Soot, Wala, and Doop) due to the dynamic features of languages. Rief et al. [21] did a comprehensive study, focussed on features in Java 9, for call-graph generation algorithms and expose the problems in the *state-of-the-art* esp. related to method calls in the Java runtime. Our work is orthogonal: we evaluate the influence of program representation on program analyses. Here, we rather focus on the program representation in static analysis frameworks and also the unsoundness arising out of it. Our study is also extensible for Java 9.

Sui et al. [33] evaluated the *recall* of call-graph construction and present how it impacts the algorithms in practice. Their evaluation expose the problems in the *state-of-the-art* esp. related to method calls in the Java runtime. Our unsoundness results concur with theirs. Here, we have focussed on program representation rather than the dynamic features of the language, which are hard to analyze for static analyzers. Further, our work features two novel metrics apart from the standard *precision* and *recall*, to measure the impact of different aspects of program representation.

### 7 Conclusion

This paper reports the effects of program representation on program analysis. Our metrics makes it possible to compare implementations leveraging different frontends. We find that differences in program representation have negligible impact on the precision of the pointer analysis. In addition, we also discovered novel sources of unsoundness and imprecision in the program analysis. Our results also demonstrate that the promised heap abstraction are practically not similar, even though they may appear so on a birds eye view. Since pointer analysis builds the foundation of many static analyses, we conjecture the results generalize these, as well.

#### References

1. Antoniadis, T., Triantafyllou, K., Smaragdakis, Y.: Porting doop to soufflé;: A tale of inter-engine portability for datalog-based analyses. In: Proceedings of the 6th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis. pp. 25–30. SOAP 2017, ACM, New York, NY, USA (2017). https://doi.org/10.1145/3088515.3088522, http://doi.acm.org/ 10.1145/3088515.3088522


Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2018). https://doi.org/10.4230/LIPIcs.ECOOP.2018.26, http://drops.dagstuhl.de/opus/ volltexte/2018/9231


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Keeping Pace with the History of Evolving Runtime Models**

Lucas Sakizloglou -, Matthias Barkowsky , and Holger Giese

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany <name>.<surname>@hpi.de

**Abstract.** Structural runtime models provide a snapshot of the constituents of a system and their state. Capturing the history of runtime models, i.e., previous snapshots, has been shown to be useful for a number of aims. Handling, however, history at runtime poses important challenges to tool support. We present the InTempo tool which is based on the Eclipse Modeling Framework and encodes runtime models as graphs. Key features of InTempo, such as, the integration of temporal requirements into graph queries, the in-memory storage of the model, and a systematic method to contain the model's memory consumption, intend to address issues which seemingly place limitations on the available tool support. InTempo offers two operation modes which support both runtime and postmortem application scenarios.

**Keywords:** runtime models · time-awareness · temporal graph queries

#### **1 Introduction to InTempo**

A (structural) *Runtime Model* (RTM) provides a snapshot of the constituents of a system and their state [3]. RTMs are typically employed in the context of *Selfadaptive Systems* (SAS) [4], where a feedback loop adapts the system behavior at runtime in response to external or internal stimuli, the latter represented as model fragments in the RTM and detected via the execution of model queries.

Encoding an RTM as a graph enables detection via *graph queries*, which specify a sought (graph) pattern. Such an encoding conforms to a *metamodel* which restricts the structure of model instances and defines types of vertices, edges, and attributes. Formally, these concepts rely on *typed, attributed graph transformation* [6] where graphs are typed over a *type graph*.

Capturing the *history* of RTMs, i.e., previous snapshots, may be useful for a number of aims such as the detection of recurrent behavior or postmortem analysis [3,8]. However, handling history at runtime poses important challenges to tool support. Tools are required to enable the specification and timely execution of queries with *temporal requirements*, i.e., requirements on the evolution of patterns over multiple snapshots. Timely execution is crucial for SAS, where a loop may depend on query results before planning and performing adaptations.

Faced with these challenges, the available tool support is seemingly limited either by the lack of support for direct specification of temporal requirements in graph queries [5] or by the on-disk representation of the model [8,11] that introduces an overhead on execution times in runtime settings, e.g., in SAS.

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 262–268, 2021.

https://doi.org/10.1007/978-3-030-71500-7 13

We present the InTempo (Incremental queries with Temporal requirements) tool (available online at [13]) which is based on the eponymous querying scheme in [15] and aims at mitigating these limitations. InTempo introduces ITQL, a language for the specification of *temporal graph queries*, which allow for the expression of temporal requirements. The core functionality of InTempo executes a query over an in-memory RTM which captures information about previous snapshots, called *Runtime Model with History* (RTM<sup>H</sup>), and returns the pattern occurrences in the RTM<sup>H</sup> that satisfy the specified temporal requirements. In-Tempo is implemented in the Eclipse Modeling Framework (EMF) [7] and can be used either via the Eclipse user interface or via an API. The latter enables InTempo results to be utilized by other tools, e.g., a SAS feedback loop.

InTempo offers two operation modes intended for different application scenarios (see Figure 1 for an illustration). The *RTM<sup>H</sup>Analysis* (Section 2) constitutes the core functionality of InTempo and executes a user-specified ITQL query (in a file with *.itql* extension—required extensions are in parentheses in Figure 1) over a user-provided RTMH, i.e., a persisted instance of an EMF model (in the standard *xmi* format). This mode returns the query results for the given RTMH. Query results are kept in-between analyses and are updated by each RTMHAnalysis, which is also known as *incremental (query) execution*. The RTMHAnalysis is intended to be used in settings where query results can be further utilized at runtime. For instance, a SAS feedback loop may use InTempo to detect problems formulated as patterns, similarly to [9]. Subsequently, the query results may be utilized to plan adaptations which address these problems.

The *LogAnalysis* operation mode (Section 3) assumes that, instead of being captured by an RTMH, past and present data about the system have been captured in an event log. InTempo introduces E2P, a specification language that allows for the mapping of event types to corresponding modifications of model fragments, i.e, nodes, edges, and attributes. As input, LogAnalysis requires the ITQL query, the log (with comma-separated values), and the E2P mapping. It then processes the log and maintains an internal RTM<sup>H</sup> which it uses to perform RTM<sup>H</sup>Analysis upon every event. LogAnalysis is intended for postmortem scenarios. Thus, it returns the results that were valid after each RTM<sup>H</sup>Analysis sorted by the log timestamps, which affords a global, yet detailed, view on the evolution of the system state.

InTempo is capable of containing the data accumulation in the RTM<sup>H</sup> by systematically discovering and discarding data that is obsolete with respect to a given timestamp, i.e., not relevant to future query executions—this capability is presented in detail in [15]. Note that an implicit requirement of both operation modes is that the metamodel of the analyzed system has been encoded as an EMF Ecore model and is available in Eclipse (gray input in Figure 1).

Fig. 1: InTempo Execution Modes and Exemplary Application Scenarios

**Exemplary Application** To demonstrate the features and operation of InTempo we rely on an example drawn from the case-study conducted in [15]. Based on real-world smart medical environments, the case-study envisions a Smart Healthcare System (SHS) where

Fig. 2: SHS Metamodel

certain medical procedures are automated and performed by devices, such as a smart pump administering medicine or a sensor tracking patient data and diagnoses—as otherwise a clinician would be doing. Data collected from the SHS are aggregated and recorded in medical (event) logs.

InTempo requires a metamodel which has been instrumented such that all nodes have at least two attributes named cts and dts, which capture the time point of creation, respectively deletion, of the node in the system. As an example, see the metamodel of our SHS in Figure 2. Note that to encode cts and dts for edges in EMF, the respective edges would have to be modeled as nodes. Technically, an RTM<sup>H</sup> is an instance of such a metamodel. See G<sup>3</sup> in Figure 3 for an example based on the SHS metamodel: The RTM<sup>H</sup> reflects that a node of type Sensor that is attached to the patient with id=1 has been activated and thus has been added to the SHS at timestamp 3. The sensor status reflects that the patient has been diagnosed with sepsis. The value ∞ reflects that a dts for this node has not been set, i.e., the node is still present in the modeled system.

### **2 RTM<sup>H</sup>Analysis**

This section presents an exemplary query in ITQL which it then uses to demonstrate the RTM<sup>H</sup>Analysis. It concludes with technical details.

**InTempo Query Language (ITQL)** Formally, a temporal graph query q is characterized by a (graph) pattern p and an *application condition* ac, denoted q = (p, ac). A match m corresponds to an occurrence of p in the RTM<sup>H</sup>. In order for m to be valid, it must satisfy the ac. ITQL supports the formulation of ac in the *Metric Temporal Graph Logic* (MTGL) [10] which supports operators such as negation (¬), existential quantification (∃), conjunction (∧), and the *metric*, i.e, interval-based, temporal operators *until* (U<sup>I</sup> , where I is a time interval over IR<sup>+</sup> <sup>0</sup> ) and *since* (S<sup>I</sup> ), as well as abbreviations such as *eventually*, i.e., ♦<sup>I</sup> <sup>∃</sup> <sup>n</sup> <sup>=</sup> *true* U<sup>I</sup> ∃ n, where n is a graph pattern and *true* is always satisfied. MTGL also supports the *nesting* of patterns to *bind* graph elements in outer conditions and relate them to inner (nested) conditions, i.e., elements common to two patterns n<sup>1</sup> and n<sup>2</sup> refer to the same element in the RTM<sup>H</sup>.

MTGL is able to express real-time properties such as *"every patient diagnosed with sepsis, must eventually within 5 time units be given the proper drug"* (adjusted from the medical guideline in [14]). In an RTM<sup>H</sup> of the SHS, In-Tempo can find violations of the property above by executing the ITQL query <sup>q</sup><sup>1</sup> = (n1, κ), with <sup>κ</sup> the MTGL formula <sup>¬</sup>( ♦[0,5] <sup>∃</sup> <sup>n</sup>2) and <sup>n</sup>1, <sup>n</sup><sup>2</sup> patterns representing a sepsis diagnosis and drug administration respectively. The query searches for matches of n<sup>1</sup> in the RTM<sup>H</sup> that satisfy κ, i.e., for patients that,

Fig. 3: Exemplary Medical Log and Corresponding RTM<sup>H</sup> G<sup>3</sup> and G<sup>9</sup>

although diagnosed with sepsis, did not receive a drug within the designated time. In InTempo, each match is associated with a *temporal validity*, i.e., a set of time intervals for which, based on the overlap among the cts and dts of the matched elements and the interval for which ac is satisfied, the match is valid. ITQL also allows for the definition of OCL constraints [12] on sought patterns.

**Output** The ITQL specification for the query q<sup>1</sup> is shown in Figure 4. Performing RTMHAnalysis for the query q<sup>1</sup> on the RTM<sup>H</sup> G<sup>9</sup> of Figure 3 returns one match, since there is indeed no Pump attached to the SHS, i.e., a match for n2, within five time units after a Sensor was activated, i.e., a match for n<sup>1</sup> was found. The temporal validity interval [3, 4] is returned together with the match. The match, i.e., violation, is indeed valid only for that interval since after timestamp 4, a match for n<sup>2</sup> starts to exist within five time units of a match for n1. If the API of InTempo is used, the


```
Fig. 4: Example query in ITQL
```
query returns the match of the n<sup>1</sup> pattern, i.e., the EMF objects, together with the temporal validity. In case InTempo is used via the UI it displays a message box in Eclipse with the following message: SHS@0[] Sensor@3[status=sepsis] [[3,4]]. Note that "@" precedes the cts of an object and values within square brackets are attributes of the object.

**Technical Details** For the execution of temporal graph queries, InTempo employs the operationalization framework presented in [15]. The framework supports the decomposition of a query into a suitable ordering of simpler sub-queries which is executed bottom-up. The outermost query computes the overall result. For pattern-matching, InTempo employs the *Story Diagram Interpreter* from [1] which uses heuristics shown to reduce the pattern-matching effort. InTempo provides an Xtext [2] editor for ITQL which supports completion suggestions for element types and validation of the query syntax.

#### **3 LogAnalysis**

This section demonstrates the LogAnalysis operation mode which assumes that data from past states have been captured as events in a log. InTempo offers the capability to process the system changes and, upon each change, obtain an updated RTM<sup>H</sup> which is then used internally to perform RTM<sup>H</sup>Analysis.

**Events-to-Patterns (E2P) Specification Language** The mapping of log events (which encapsulate system changes) to prescribed modifications on an RTM<sup>H</sup> is facilitated by E2P. An E2P specification consists of mappings between events and actions that should be performed on an RTM<sup>H</sup>. E2P supports five actions


#### Fig. 5: E2P Example for SHS

(formulated as verbs): *adds*, to add a node and optionally assign values to the added node's attributes; *adds-ref*, to add an edge between two nodes; *modifies*, to modify the attribute values of a node; *deletes* and *deletes-ref*, to delete a node, respectively an edge, from the RTM<sup>H</sup>. To accommodate linked data, E2P allows for the *indexing* of added nodes so that later events can refer to modifications that have been processed earlier. An example of an E2P mapping from an exemplary log in Figure 3 (left) to the corresponding elements of the SHS is shown in Figure 5. Note that edge types, e.g. OwnedPumps, are not depicted in Figure 3.

As an example, the event drug administration from the medical log in Figure 3 corresponds to the following changes to the (internal) RTM<sup>H</sup>: a Pump is created; its attribute status is set to "drug" and its attribute id takes the value of the second field after the event name (expressed by the special ∗*p* token), i.e., the id field in the log of Figure 3. By default, the cts is set to the value of the event field that is next to the event name, i.e., the ts field in Figure 3. The *init* statement is used to initialize the RTM<sup>H</sup> and the cts of nodes within is set to zero. To increase the readability of specifications, an explicit assignment for the dts may be omitted: Unless there is an attribute assignment, the dts of all nodes is set to the maximum value supported.

**Output** LogAnalysis provides a view on the matches *per event timestamp*. Performing LogAnalysis on the query q<sup>1</sup> and the log of Figure 3 would return:

```
@3 SHS@0[] Sensor@3[status=sepsis] [[3,∞]]
@9 SHS@0[] Sensor@3[status=sepsis] [[3,4]]
```
First, the sepsis diagnosis event is processed which makes the internal RTM<sup>H</sup> be identical to G<sup>3</sup> in the same figure. The query is executed using RTM<sup>H</sup>Analysis and returns a match, i.e., violation, since at that moment a match for n<sup>2</sup> does not exist in the graph. The temporal validity is equal to [3,∞], i.e., the match is valid from time point 3 onward. Next, the drug administration event is processed which leads to G9. The result of RTM<sup>H</sup> Analysis for G<sup>9</sup> is the same as the result described in Section 2.

**Technical Details** In LogAnalysis the query execution framework monitors the RTM<sup>H</sup> for changes and, upon every change, recomputes the matches. Previous matches are kept in-between executions and therefore the query is executed incrementally. Similarly to ITQL, E2P is supported by an Xtext editor that offers syntax validation and completion suggestions for element types.

# **4 Conclusion and Future Work**

We presented InTempo, an EMF tool which enables the specification and incremental execution of temporal graph queries over a runtime model with history. The latter can be either provided as input or obtained by an event log. InTempo stands out from relevant tools owing to the integration of temporal requirements into graph queries, the in-memory representation of the model, and the systematic measures to contain memory consumption despite the accumulation of temporal data. Moreover, InTempo offers input editors with features that aim at helping the user, e.g. syntax validation. In the future, besides streamlining InTempo, we plan to perform extensive evaluation and comparisons with other tools. Moreover, we plan to explore the utilization of InTempo in self-adaptation scenarios where the history of the system is required.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# SpecTest: Specification-Based Compiler Testing

Richard Schumi( -) and Jun Sun Singapore Management University, Singapore, Singapore {rschumi,junsun}@smu.edu.sg

Abstract. Compilers are error-prone due to their high complexity. They are relevant for not only general purpose programming languages, but also for many domain specific languages. Bugs in compilers can potentially render all programs at risk. It is thus crucial that compilers are systematically tested, if not verified. Recently, a number of efforts have been made to formalise and standardise programming language semantics, which can be applied to verify the correctness of the respective compilers. In this work, we present a novel specification-based testing method named SpecTest to better utilise these semantics for testing. By applying an executable semantics as test oracle, SpecTest can discover deep semantic errors in compilers. Compared to existing approaches, SpecTest is built upon a novel test coverage criterion called semantic coverage which brings together mutation testing and fuzzing to specifically target less tested language features. We apply SpecTest to systematically test two compilers, i.e., the Java compiler and the Solidity compiler. SpecTest improves the semantic coverage of both compilers considerably and reveals multiple previously unknown bugs.

Keywords: Mutation testing · Compiler testing · K framework · Formal semantics · Rare language features

#### 1 Introduction

Compilers must be thoroughly tested (if not verified) for multiple reasons. First, compilers are essential for the software ecosystem. Their correctness is a prerequisite for program correction. That is, a compiler bug might propagate to all produced programs. Second, compilers are error-prone due to their high complexity. Their main functionality is to convert source code to executable machine code. They often provide additional features, like code optimisation or debug utilities. A variety of compilers has been written for countless languages. Modern compilers like GCC, javac, and LLVM are overwhelmingly complicated (e.g., GCC has more than 7M lines of code and OpenJDK has more than 11M [20]). Although some of them have been used for decades, they may still be buggy [54,55].

Recently, there have been numerous efforts on formalising and standardising programming language semantics, such as K-Java [24], C semantics [29], KJS [47], or KSolidity [34,44], which readily serve as a specification of the respective compilers. Usually, these executable semantics are accompanied by manually crafted unit tests. Such tests are however designed to test the semantics rather than the compliance of the compiler to the language semantics. In this work, we aim to better utilise these semantics by automatically generating test programs with a novel coverage criterion that facilitates systematic compiler testing.

Multiple approaches have been recently proposed to test compilers. Most of them successfully found compiler bugs. For instance, the EMI project discovered

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 269–291, 2021.

https://doi.org/10.1007/978-3-030-71500-7\_14

more than 1600 bugs in GCC and LLVM [53]. Another study has revealed bugs in the Java compiler by comparing different javac and JVM versions [27]. For the relatively new Solidity (smart contract) language, many crashes were found through fuzzing [28]. Moreover, bugs in compilers may be exploited by attackers. For example, prior to version 0.5.0, the Solidity compiler had an uninitialised storage pointer vulnerability that affected many smart contracts on Ethereum. A honey pot named *OpenAddressLottery* was designed to exploit this vulnerably and steal ether (i.e., digital money in Ethereum). There are hundreds or even thousands of programming languages according to different sources [30] and many new ones emerge every year. For example, various new general purpose or domain-specific languages have been developed recently, such as Rust, Kotlin, Solidity, and Move.

Compiler testing is an ongoing research field. Next, we briefly review existing approaches according to how they address the following two problems.


Existing compiler testing approaches solve the test generation problem mainly through two ways, by generating programs according to a grammar that specifies the syntax of a language [49,31,23], or by mutating existing seed programs [40,55,41]. For the former, due to a huge search space, additional selection criteria must be applied to selectively generate test cases for compilers, such as standard code coverage criteria like statement coverage. For the latter, existing mutation strategies are often limited by the 'weak' oracles (as we will discuss shortly) employed by the approach, e.g., mutating to introduce 'dead' code. Generally, approaches which generate complicated syntax focus more on parsing errors instead of errors in the semantics. For the oracle problem, existing proposals mainly have three oracles. The first oracle is one that only flags a test failure if the program is incompilable or leads to crashes [28]. The second oracle flags a test failure if certain algebraic properties are violated. For instance, the algebraic property adopted in the EMI approach [55] is that mutating unreachable code does not change the execution result. We remark that these two oracles are 'weak' as they are unable to detect simple semantic errors such as 3+4 = 8. The third, stronger oracle is one that checks whether the output of a test program is consistent with a reference, which could be a second compiler (i.e., differential testing [45]), or an abstract specification like a state machine [35,36]. This oracle requires a reference, which is not always feasible. Furthermore, it is limited to bugs which result in inconsistencies between the compiled program and the reference. Last but not least, existing approaches do not provide a good adequacy measurement on the progress of compiler testing. Often measurements, like code coverage, are used as an indicator, but they have the limitation that they need access to the compiler code, and achieving full code coverage is challenging.

In this work, we present a novel specification-based testing method called SpecTest for compiler testing. SpecTest differs from existing approaches in the following aspects. First, SpecTest is built upon a strong oracle, i.e., an executable language specification that can predict the expected output of test programs. This strong oracle enables us to detect semantic errors, i.e., bugs that are related to the semantics. Such bugs may also originate from the runtime environment. Hence, SpecTest is not just limited to classical compiler bugs. Second, SpecTest offers a testing adequacy measurement in term of semantic coverage and has a built-in mutation-based test case generation method which aims to achieve high semantic coverage. The semantic coverage measures the number of language semantic rules that are covered by existing test cases. The test case generation method mutates the seed programs accordingly to maximise the coverage of the language semantics, e.g., by introducing less-tested language features into these programs. Compared to measuring the code coverage of a compiler, our semantic coverage has the added value that it does not need access to the compiler code, and it specifically targets semantic bugs.

Given a language semantics (in the form of a set of small-step operational semantic rules), SpecTest executes fully automatically. We have implemented SpecTest for two compilers, i.e., the Java compiler and the Solidity compiler and tested the language features that are supported by our applied semantics [24,44]. The results of the evaluation were promising. SpecTest successfully increases the semantic coverage for both compilers, and identified many bugs and issues that helped the compiler and specification developers.

To sum up, we make the following technical contributions.


The paper is structured as follows. Sect. 2 explains our method and discusses the required components in detail. In Sect. 3, we present our evaluation with two compilers. Next, we review related work in Sect. 4 and conclude in Sect. 5.

#### 2 Method

In this section, we outline how SpecTest works. In particular, we present its high-level design, highlight relevant details of its components, and explain the workflow step by step using an example.

#### 2.1 Overall Design

The overall workflow of SpecTest is depicted in Fig. 1. In the following, we introduce the tasks briefly before diving into the details of the main components.

(1) A set of user-provided seed programs are given as input to a program fuzzer one by one, which generates a set of test inputs for each program with the intention to cover as many program paths as possible. A program and the associated test inputs form a test case that is the basis for the next phase, the test execution and evaluation. (2) The program is compiled with the compiler

and executed with test inputs generated by the fuzzer. The final state (i.e., variable valuations) is obtained as the program execution result. (3) An executable language semantics is executed with the same program and the same inputs, through firing a set of structural operational semantic (SOS) rules. The final state is ob-

Fig. 1: Overview of the data flow of SpecTest

tained as the semantic execution result. During the semantic execution, we monitor how frequent each SOS rule is fired in order to identify rarely fired rules. (4) The results of the program and semantic execution are compared in order to assess whether the program (built by a compiler) produces an output which is consistent with the language semantics. If the results are inconsistent, the test case is flagged as a failure. The failure may be either due to a bug in the compiler (or the execution environment of the program, e.g., JVM) or in the language semantics. (5) We rank the SOS rules according to the number of times they are fired and identify the ones which are least fired. Each SOS rule is typically associated with one language feature and thus we are able to systematically identify language features which are least tested. With the information, a program mutator mutates the seed programs so that the corresponding language features are introduced systematically into the programs. In contrast to classical mutation testing [33], which ensures the quality of test suites, we apply mutations to generate more and better test cases. (6) We then repeat from step (1), and the process continues until a user-specified timeout is triggered. The output of SpecTest includes a set of passed/failed test cases as well as a report on the semantic coverage, i.e., the number of times each SOS rule is fired.

It should be noticed that there are three main components in SpecTest, i.e., the executable program semantics which serves as oracle, the program fuzzer, and the program mutator. We present details of these components in the following.

#### 2.2 The Oracle

The oracle is an executable semantics of the programming language. That is, the oracle encodes the language semantics in the form of small-step SOS rules. Given a program (and necessary inputs for the program), the oracle is capable of executing the program according to the language semantics to produce the expected output, without going through the compiler to be tested.

Creating an executable semantics for a programming languages is not trivial. It requires experience as well as effort. Nonetheless, it is desirable to have one

```
rule I1 + I2 => I1 + Int I2
rule if ( true ) S else _ => S
rule [Allocate -Global - NonArrayType]:
<k> #allocate(N, CN, #varInfo(X:Id, E: Value , T:NonArrayType , # storage , L)) =>
     . ...</k>
<account >
   <acctID > N </acctID > <contractName > CN </contractName >
   <acctEnv > CONTEXT:Map => CONTEXT [X <- #storedVar(Slot + Int 1, T, # storage ,
         1)] </acctEnv >
   <acctStorage > STORAGE :Map => STORAGE [Slot + Int 1 <- E] </acctStorage >
   <acctSlots > Slot => Slot + Int 1 </acctSlots > ...
</account >
```
Fig. 2: Example SOS rules for Solidity [44]

because it provides a reliable way to check the correctness of compilers, and it will save time and effort in the long term since it effectively reveals ambiguities, inconsistencies and incompleteness. Many researchers have realised the importance of executable language semantics and have built foundations that we can work with, like the K framework [50], Redex [37], or Ott [51]. There are already executable semantics for many programming languages, like C, JAVA, JavaScript, or Solidity, which represent a strong oracle for compiler testing.

It is conceivable and in fact confirmed by our experiments that the oracle itself can be buggy due to human errors in encoding the language semantics or due to ambiguity in the language semantics in the first place. However, even a potentially buggy executable semantics is much better than none for the following reasons. First, during the above-mentioned process, SpecTest is able to identify bugs in the oracle, which helps to improve the language semantics. Second, bugs in the semantics are overall less likely compared to compiler bugs since the compiler must not only implement the semantics but also handle sophisticated code optimisations, which are known to be error-prone.

In this work, we apply the K framework [50] as a basis for our oracle. The K framework provides convenient notations for defining language semantics or type systems based on rewriting rules, configurations, and computations. It comes with a range of supporting tools, like a parser, an interpreter, or a program verifier, which enable the execution of the specifications. In short, it combines the functionality of both the compiler and the runtime environment. Encoding small-step SOS rules in the K framework is relatively straightforward. For example, Fig. 2 shows three (simplified) rules defined for Solidity (i.e., a language for programming smart contracts) programs. In particular, the first rule shows how simple addition should behave for Integers, given the existing k construct for addition +Int. The second example is a rule for an if conditional statement, where the condition is true and the result is the then-branch. Not all rules are simple though. The third example is a rule for the storage allocation of a global non-array variable. In general, the rules become more complex for sophisticated language features such as concurrency or higher order functions.

In this work, we adopt and extend the K semantics for Java [24] and Solidity [34,44] to implement SpecTest. The K semantics for Solidity, called KSolidity, has currently 304 rules. The K semantics for Java, called K-Java, has 1385 rules. K-Java was developed for an earlier version of Java (1.4) and some rules are deprecated or unreachable. Our extension to these existing efforts concerned mainly two aspects, i.e., extending them with proper interface and conversion so that they work with other components in SpecTest; and introducing a measurement feature for semantic coverage. For example, we enhanced the coverage engine of the old K version for K-Java, and we added a visualisation of the covered rules.

Given a test case (in the form of a program with inputs), the executable semantics is used as follows. First, the test case is executed using the built-in execution engine of the K framework which fires the SOS rules one by one. The final variable valuations are captured as the result of the test case. For instance, for Solidity, we capture all the persistent states in the blockchain network (which includes addresses, their balances and the values of storage variables). This testing result is turned into an assertion in the test case. The test case with the assertion is then executed using the compiled program. If the assertion fails (e.g., the value of at least one variable is different), a bug is revealed.

Simply applying the above-mentioned steps to test compilers would not be comprehensive. That is, existing seed programs often use a limited set of common language features and thus would not be able to test the compiler extensively. In fact, our experience on testing the Solidity compiler with existing smart contracts suggests that many smart contracts are suspiciously similar. As a result, the test cases would only exercise a limited set of semantic rules and thus would miss those bugs in the part of the Solidity compiler that encodes the remaining semantic rules. While collecting a large set of seed programs would likely be helpful, the larger problem at stake is whether there could be a certain quantitative measurement on the comprehensiveness of the test cases and whether we can use the measurement to guide the generation of new test cases? SpecTest's answer to this question lies in the design of the mutator and the fuzzer.

#### 2.3 The Mutator

Due to the high complexity of modern compilers, it is important that a meaningful coverage criterion is applied for compiler testing. Existing approaches either are not concerned with coverage or they use coverage criteria which are not ideal for compiler testing. Hence, we introduce our novel semantic coverage.

Definition 1. *Given* R *is the set of all semantic rules of our specification,* T *is the set of our given test programs,* I<sup>t</sup> *is the set of all possible inputs for the test program* t ∈ T*, and* cover(t, i, r) *is a predicate that is true when there exists a test program* t *and a test input* i ∈ I<sup>t</sup> *for* t *and they are able to fire the semantic rule* r *of our specification; our semantic coverage can be defined as follows:*

∀r ∈ R : ∃t ∈ T : ∃i ∈ I<sup>t</sup> : cover(t, i, r)

This means that to achieve semantic coverage (or at least increase it), it is not only important that we have good test programs, but also the test inputs for these programs are essential. In order to produce good test programs, we apply our mutations that inject language features to specifically target the uncovered rules as we will explain in detail in the following. The coverage of all rules r ∈ R would give us full semantic coverage, but in reality this is often infeasible, hence we also depict it as the percentage of rules that are covered.

In SpecTest, we achieve high semantics coverage with the following two synergistic parts. First, we design and implement a mutator which systematically introduces less-exercised language features into the test programs automatically. Second, we design and apply powerful fuzzing techniques to generate program inputs to exercise all statements including the less-used features in the test programs. The latter can be achieved with fuzzers optimised for existing code coverage criteria such as branch or statement coverage.

We believe that a comprehensive test suite for a compiler must cover all relevant aspects of the language semantics, and semantic coverage offers such a measurement. The above definition simply measures whether a rule is fired or not. It might be meaningful to further measure the context in which each SOS rule is fired (as certain bugs might only be triggered when a rule is fired in a certain context), which we leave as future work.

To achieve high semantic coverage, SpecTest employs a two-part solution. Given the oracle's feedback on which SOS-rules are not fired (or least fired), the language features which are associated with the SOS rules are identified. This is straightforward as each SOS rule is associated with a specific language construct. For instance, when the first rule of Fig. 2 is not fired, then this would highlight that our test programs contain no addition between Integer variables. Next, the mutator takes the information and systematically mutates the seed programs to introduce these less-tested language constructs.

The mutator is a code mutation engine which is designed to automatically mutate a given source program to generate new programs (i.e., test cases for the compiler). Existing mutation approaches [38,41,55] for compiler testing already applied mutators to generate test programs, but they mutate based on simple algebraic rules and are not systematic. For instance, equivalence modulo inputs (EMI) [41] works by injecting code into seed programs with the aim to achieve a high difference in the control- and data-flow compared to the original seed program in order to produce diverse test programs. In comparison, our mutator is designed to maximise semantic coverage.

Implementing the mutator is not trivial. For SpecTest, the mutators for Solidity and Java were implemented based on existing parsers through code instrumentation. That is, given a language feature and a source program, the mutator first parses the source program to build an AST. Afterwards, it identifies potential locations in the AST for introducing the features. Lastly, it systematically applies a mutation strategy specifically designed for the language constructs to inject them at all possible or specific pre-defined locations. In the following, we introduce three mutation strategies as examples.

We investigated features that were specific for Solidity. For example, one mutation introduces modifiers for functions, which define conditions that must hold when a function is executed. Listing 1.1 shows a smart contract with modifiers written in the Solidity language. Unlike traditional programs, smart contracts cannot be modified once they are deployed on the blockchain. As a result, their correctness is crucial. So is the correctness of the compiler since the compiled programs are deployed on the blockchain. Furthermore, the Solidity compiler

```
1 contract AccessRestriction {
2 address public owner = msg . sender ;
3 //default modifier:
4 modifier onlyBy( address account){
5 require ( msg . sender ==account , "Sender
          not authorized") ;
6 _; } //injected modifier:
7 modifier cgskst( address value ){
8 require ( value == address (0x0),"") ;
9 _; } //injected modifier:
10 modifier cbhsmo( address value ){
11 require ( value == address (0x0),"");
12 _; } //injected modifier:
13 modifier nlwxmv( address value ){
14 require ( value == address (0x0),"");
15 _;
16 }//Make newOwner the contract owner:
17 function changeOwner( address newOwner
        ) public onlyBy(owner) cgskst(
        address (0x0)) cbhsmo( address (0x0
        )) nlwxmv( address (0x0)){
18 owner = newOwner;
19 }}
                                            bibt4QkDIfJ: {
                                              bsJxhbtSJBu: {
                                                  bHhq23OwDjZ: { try {
                                                      bEdqZ33tKi9: {
                                                        bVm9tCxbul4: {
                                                          if (i >= 5){ break ; }
                                                             break bEdqZ33tKi9;
                                                        } }
                                                    } catch (RuntimeException e){
                                                      bQ2yucCPLQr: {
                                                        System.out.print("X");
                                                        break bQ2yucCPLQr;
                                           }}}}}
                                           Listing 1.2: Labelled block mutation
                                            contract Test {
                                              function testFunc( int a )
                                                public pure returns ( int) {
                                                int result = a + a++;
                                                //produces 3 when a is 1
                                                return result;
                                           } }
```
Listing 1.1: Simple modifier example

has been under rapid development and there are unique language features with sometimes confusing semantics. Thus, it is a good target for evaluating the effectiveness of SpecTest. In this example, the modifier onlyBy ensures that the function changeOwner can only be called when the address of the contract owner is used. By integrating various dummy modifiers (Lines 7, 10 & 13) into our seed contracts and by adding them to functions (Line 17), we noticed that an older version of the Solidity compiler crashed in some cases, when more than a certain number of modifiers are used. Such a case is difficult to find with normal tests, since it is rare to use multiple modifiers for a function. Given that a less-fired SOS rule is concerned with the modifier construct in Solidity, to introduce modifiers, the mutator scans through the AST for function declarations. For each function declaration, the mutator randomly adds one or more modifiers.

We also introduced specific mutations for Java. For example, our experiments showed that semantic rules associated with labels were not fired. Hence, we introduced mutations that target these rules, e.g., a mutation that injects labelled blocks, which is a special and rarely used feature that allows an immediate exit of a block with a break statement. This mutation is illustrated in Listing 1.2, where we injected labelled blocks and breaks (with these labels) into a seed program.

Both for Solidity and Java, we noticed that there are various rules in the K specifications (i.e., 11 rules Java and 17 for Solidity) concerning mathematical expressions that were not covered, e.g., computations with hex-values. In order to cover these rules and to cover unusual usages in different contexts, we relied on a random approach in contrast to the other mutations where we injected code at specific places. We developed mutations that produce a variety of mathematical expressions combining various language features, like operations containing variables with different data types, hexadecimal, octal or binary literals, preand postfix increment/decrement (++/--), bitwise and bitshift operators, various combinations of unary operators and arrays. A simplified example of a mutation produced with this strategy is shown in Listing 1.3. It can be seen that the increment operators (++) is used in an unusual context within a mathematical expression. Our experiments showed that the computation produced unexpected results, i.e., we found an issue with the computation order that caused the increment to be executed first, although it should be executed last [19].

#### 2.4 The Fuzzer

By injecting specific language features into the seed programs, the mutator increases the likelihood of firing uncovered or poorly covered SOS rules during the test execution. The fuzzer is a fuzzing engine which generates test inputs for a given program. The generation is based on optimization (e.g., using genetic algorithms). One of the required inputs for the fuzzer is a set of seed source programs. Such source programs are often abundant. For instance, there are thousands of Solidity programs (contracts) on EtherScan.io. The fuzzer takes these contracts as input and generates test inputs for each contract. During this process, the fuzzer sets up a test blockchain network, deploys the contracts, and generates a sequence of transactions which invoke functions.

For Solidity, we applied an existing smart contract fuzzer called sFuzz [46] that works with a new adaptive fuzzing technique for maximising the branch coverage. sFuzz uses an optimised version of a technique called American Fuzzy Lop (AFL) [59], for producing inputs that can achieve a high branch coverage. It includes various test oracles for the detection of general vulnerabilities, like Integer overflows, or smart contract specific vulnerabilities, like a gasless send [48]. We applied sFuzz to maximise the coverage of our test programs to cover our injected features. For our injected features, the coverage was usually easily achieved. However, for other cases or to minimise the test inputs, it might be necessary to customise the fuzzer to specifically target newly added language features. For example, during the mutation, we can record which parts of the contracts have been mutated and prioritise those parts during fuzzing. For Java, we did not apply a fuzzer, because the majority of our seed programs were simple in nature. A single run produced full coverage in almost all cases.

#### 3 Evaluation

We have implemented SpecTest for two compilers, a compiler for a general purpose language (Java) and one for a new domain-specific language (Solidity). In the following, we design multiple experiments to systematically answer the following research questions (RQ).


#### 3.1 Test Setting

As seed programs, we used existing tests cases of K-Java [24] and KSolidity [34,44]. KSolidity is still under development, which means that we could not test all features or a large set of contracts, but it was already sufficiently developed to support many interesting cases. K-Java supports most features of Java 8, but it also has limitations, i.e., it was implemented in an old version of the K framework, which did not focus on performance. Hence, we used seed programs without imports of libraries. We do not regard this as a limitation since small programs have advantages, e.g., they are easier to debug and it reduces the time for test case minimisation. Moreover, it is well-known that many bugs can be revealed by small test cases [32], which are also common in traditional testing.

For Solidity, we had 37 seed programs that were part of the KSolidity project due to its early stage. Hence, it makes sense to apply SpecTest since it enables the generation of more test programs in a systematic way. Our mutator for Solidity is written with about 5,300 lines of Java code. In each test run, we applied one of our mutations (or in some cases also combinations) to the seed programs. We applied sFuzz to the mutated contracts and then converted the resulting test cases in a usable form for KSolidity. We primarily tested the Solidity compiler version 0.5.13, but initially also older versions. In some cases, we had to apply Truffle tests [21] (v5.1.10) and for debugging we used Remix [18], which facilitates a step-by-step exploration of the contract bytecode.

For Java, we applied 756 seed programs and our mutator has about 6,100 lines of code. The mutations were similar as explained before. In contrast to Solidity, we did not need a sophisticated fuzzer since the mutated Java programs were covered easily. Our focus was Java 13 (openjdk 13, 2019-09-17, RE build 13+33- Ubuntu-1), but we also tested older versions (11 and 8). For the mutator, we applied JavaParser 3.14.3 for parsing the programs and for injecting mutations.

The experiments for Solidity were performed on a Dell X1 Carbon with an Intel i7-8565U CPU with four 1.80GHz cores and 16 GB RAM, for Java on a PC with an Intel i7-7700 CPU with four 3.60GHz cores and 64 GB RAM.

#### 3.2 Experiment Result

We ran more than 30,000 test cases for Java, which had a total execution time of about three weeks. For Solidity, we ran more than 50,000 test cases with a total execution time of about two weeks. Details about the distribution of the run time will follow below. The execution times are not exact numbers, since the experiments sometimes were stuck due to out of memory exceptions, not enough space, etc. Unfortunately, we could not fully resolve such issues, because many mutations inject features with random aspects into the diverse seed programs. This caused various unpredictable situations, like endless loops or too large data structures. By adopting our mutator, we greatly reduced the number of such situations, but we could not remove all rare cases.

*RQ1: How effective is our proposed method in finding bugs or inconsistencies?* We discovered issues and bugs both for Solidity and Java. Some of these issues were not found within the compiler or the runtime environment, but within the language semantics. Fixing such issues is also essential, since improving the specification is an important aspect of testing.

In total, we found six issues for the Solidity compiler [19,10], two were related to error/warning messages [7,13], and three of the other issues might have the same cause, i.e., the execution order. For KSolidity, we found eight issues, six of them were related to unimplemented features. For Java, we found four issues with the compiler [2,5], two of which were concerned with error messages [6,12], and we discovered 13 issues with K-Java [14,15,11,9,8,3,1,16] (eight issues or bugs, one warning related issue, and four minor issues, like a wrong output representation [16]). More details about the different types of issue follow below.

Our experiments showed that SpecTest is able to reveal issues, inconsistencies and bugs. These issues were not only found in the compiler, but also in language semantics (which are developed independently by other groups with dedicated effort). One might argue that finding bugs/issues in the language semantics is not as meaningful as finding bugs in the compiler. We believe that it is also crucial to ensure the robustness of the semantics since in general the quality of the tests or specification are essential for the overall robustness of software. SpecTest was able to find various inconsistencies and bugs in the specifications, which is important for the specification developers, as well as issues in the compilers. We have spent effort on confirming our findings and out of the 31 issues, we submitted 19 to the corresponding git repositories and reported the other issues to the developers or to a bug reporting system. For 13 issues, we received a confirmation or the developers mentioned that they will investigate and fix them.

An aspect that might have limited the effectiveness, is that we did not fully apply our method for Java, since we only tested simple seed programs and did not use fuzzing. We believe that the issues we found still showed that our method was reasonably effective, even though we only partially applied it. Using the full extend of SpecTest for Java might require a more powerful specification, which is a potential topic for future work. Moreover, it should be mentioned that KSolidity is still being developed and not as stable as the Solidity compiler (or runtime environment), since much more effort was invested into its development. This is similar for K-Java, and Java in general is robust due to its maturity.

*RQ2: What kind of bugs and inconsistencies can be found?* We categorise our findings into three categories as illustrated in Table 1, i.e., (1) normal issues, bugs and missing features, (2) issues related to warning or error messages, and (3) minor inconsistencies or issues, like a small discrepancy in the output, e.g.,



Table 1: Found semantics and compiler issues

The most interesting issues that we found were the ones concerning the wrong computation order in Solidity. The cause of these issues were actual semantic errors within the compiler. Moreover, we also found various issues with error or warning messages. Such issues might seem trivial, but it is important to fix them since meaningless error messages can cause a huge waste of debugging effort. The bugs we found in the specifications had multiple sources, like the syntax parser, wrong semantic rules, partially implemented rules, or rules applied in a wrong context. Although K-Java and KSolidity had already many manual tests, we showed that SpecTest was able to discover many inconsistencies and bugs. In the following, we present example issues from the mentioned categories.

*Solidity Findings.* One of the issues [19] that SpecTest identified was that there were wrong results, when we tested expressions with different assignment operators. The behaviour can be observed in the following example, where the increment operator is applied at first, but should be applied in the end.

```
int a = 2; a *= 1 + a++; //results in 9 but should be 6
```
A potential cause might be a wrong computation order. This issue was found since some SOS rules for assignment operators were uncovered. By creating mutations that target these rules, we could generate expressions like in the example which led to the discovery of the issue since the oracle predicted a different result.

An inconsistency regarding an error message [13] was revealed when we tested computations with different data types. As illustrated below, we discovered that it is possible to add int variables with different bit sizes, but an error is produced if an int\_const is added to an int variable with a smaller bit size.

```
int8 a = 10; int16 b = 234;
int c = b + a; // works
int c = 234 + a;//TypeError: Oper. + incompatible with types int_const & int8
```
In this case, our oracle performed the computation without an error, but the Solidity compiler produced a type error. For KSolidity, we found an incorrect overflow behaviour for computations, and that there is no support for numerous language features, like increment operators.

Additionally, we applied our Solidity truffle tests to the Conflux blockchain [17], which is a new alternative for Ethereum. It basically can be seen as another runtime environment for Solidity contracts. With our tests, we were able to reveal a bug in the testing environment that resulted in incorrect results when we injected formulas with unary and bitwise operators [4].

*Java findings.* Our experiments showed that there is an inconsistency [1,2] when casts from double and long variables to Integers are performed. These casts are handled differently by Java when an overflow occurs, i.e., in the following code the results will be the maximum Integer for the double cast and bits will be cut off for the long cast. In K-Java both casts produce the same result, i.e., bits will be cut off. Although this behaviour is documented in the language specification and already others were wondering about this issue, we believe that the approach of K-Java is more consistent, and we are still waiting for a comment of the Java team about the motivation to handle these cases differently.

System.out.println((( int )2147483648L)); // -2147483648 System.out.println((( int )2147483648.0)); // 2147483647

A problem we found for the Java compiler [6] is a missing error message when a computation with a long and a double variable is performed. Normally, an incompatible types error is produced as illustrated in the following code, but the error does not occur when the same computation is done with an += operator.

long a = 1L + 0.1 \* 3L; // produces error: incompatible types: possible lossy long b = 1L; // conversion from double to long b += 0.1 \* 3L; // no error is produced

We discovered that K-Java has an issue with the modulo operator [14]. The computation is wrong for all negative doubles and floats, i.e., it produces inconsistent values compared to Java and compared to the same computation with Integer values. This is illustrated in the following examples.

```
System.out.println("-8 % 3 = "+(-8 % 3)); //K-Java and Java return -2
System.out.println(" -8.0 % 3.0= " +(-8.0 % 3)); //K-Java 1.0 Java -2.0
System.out.println(" 8 % -3 = "+( 8 % -3)); //K-Java and Java return 2
System.out.println(" 8.0 % -3.0= "+( 8.0 % -3.0));//K-Java -4.0 Java 2.0
```
In general, we found most issues, when we injected mathematical expressions into the seed programs. This was an interesting finding for us, since these expressions are a major component of all programming languages, and we assumed it would be straightforward to develop a specification for them. However, it turned out that many interesting and ambiguous situations can occur when various combinations of operators, variables and literals are tested.

*RQ3: Can SpecTest effectively improve semantic coverage?* The objective of SpecTest is to systematically generate a test suite for achieving better semantic coverage. In order to evaluate the coverage, we conducted the following experiments. We identified the semantic rules that were least covered by the existing tests for Solidity and Java, and then applied SpecTest systematically (with specific mutators) and measured the improvement in terms of semantic coverage.

First, we evaluated the semantic coverage criterion that is achievable with the original seed programs of K-Java and KSolidity to have a reference value for the comparison with the mutated test programs. Table 2 shows a comparison of the coverage from the original test cases from K-Java to our mutated test cases. The rule coverage of this early version of the K framework of K-Java is rudimentary. Hence, we could only measure the covered lines and characters of the rule files, and many of these files were already fully covered due to redundant or unreachable rules. Nevertheless, we were able to identify various uncovered rules in four of the files, and we produced mutations that covered these rules.

KSolidity was built with a new version of the K framework, which has a better measurement of the rule coverage. Since the development of KSolidity is still ongoing, we focused on the completed features, like conditions, loops, arrays, structs, simple transactions, or mathematical expressions, and managed to increase the coverage. Even with just these features, we found meaningful bugs. The coverage improvements compared to the original seed programs are illustrated in Table 3. There were partially implemented features which could not be fully covered. The coverage of the completed features was considerably improved.

Table 2: Comparison of the covered rules between the K-Java tests (Default) and our mutated test cases


We have shown that our mutations can increase the rule coverage both for K-Java and KSolidity. Our close investigation shows that the increase in coverage requires non-trivial programs (e.g., programs that specifically include missing language features) which are unlikely to be generated without our mutator. It is worth mentioning that writing mutations for the uncovered rules lead to the discovery of many issues. Moreover, the mutations that targeted specific semantic rules or language features could generally increase the coverage instantaneously with a single test, but we still applied them to all seed programs, and we also used general mutation operators to produce mutants for many different situations.

*RQ4: How much effort is it to apply SpecTest?* To answer this question, we analysed the effort required to apply and implement SpecTest for Java and Solidity. It consists of two parts, the effort of applying SpecTest once it is developed, and the implementation effort. The latter one consists of three parts, the effort for developing the oracle, the mutator and the fuzzer. The goal of this analysis is to understand how generalisable SpecTest is to a new programming language.

Applying SpecTest after the implementation has the following timing requirements. Both for Solidity and Java, the mutant generation took only a few seconds. For Solidity, we set a timeout of 2 min per contract for fuzzing and it took on average 24 min to finish all 37 contracts. Usually, 40–45 test cases were created by the fuzzer (normally multiple per contract depending on the mutation). Most test cases were executed by KSolidity within a minute, but there were outliers which did not terminate even after hours. Hence, we used a timeout of 5 min. On average, the testing time of KSolidity was 37 min (when five runs with different mutations were considered). For Java, we did not apply a fuzzer due to the simplicity of the seed programs. We executed the 756 test programs directly with K-Java, which took on average 3 hours and 51 min for an introduced mutation (for five runs with different mutation types).

We now discuss our development efforts and the time requirements of the implementation of SpecTest for a new language. In our case, the most effort went into the development of the mutator and the supporting tooling, like translators. The implementations for both Solidity and Java took about two to three months each. It



should be noted that this time depends on the availability of existing tools, like a language parser or fuzzer. For this work, we relied on pre-existing language specifications, which helped to reduce the overall effort, but as mentioned they came with limitations, which caused additional efforts. Writing a specification for a new programming language is not trivial. Based on past experiences, we assume that it takes about six to 12 months depending on the complexity of the language. Given the many recent efforts on developing executable language semantics, we believe that SpecTest provides a good way to better utilise these existing specifications for systematic compiler testing.

To summarise, the implementation effort of SpecTest is about two to three work months mainly for the mutator, if there is an existing specification and a fuzzer. The application of our method in terms of run time is about a few hours for a single mutation. Further increasing the number of seed programs, and performing a reasonable number of mutations increases this time to a couple of days or weeks, when the tests are only executed on one machine. Even though this seems like a lot of effort, we believe our method is still worthwhile, since it will pay off eventually, especially considering all the effort that can be required for releasing a new compiler version, when serious bugs are discovered. Moreover, our method can be easily accelerated by distributing it to multiple machines.

As mentioned before, the implementation effort for our method was about two to three work months. This is about the time that is needed for the mutator and for other minor tools. It does not include the effort for creation of the language specification or the fuzzer. There are already many existing fuzzers that could be adopted for new programming languages, and also numerous language specifications. We especially want to recommend our method for all languages with pre-existing specifications (or when similar specifications exist) since then there is only a small implementation effort, which will soon be mitigated by the advantages of SpecTest. Even when there are no pre-existing specifications for a language, we highly recommend to create one and to adopt our method, since it will save time in the long term.

An effort that should not be underestimated is the time for analysing bugs. It can be troublesome and to find the cause of a bug, due to the complexity of the test cases, i.e., it sometimes took us hours or even days. In such cases, it can be helpful to minimise failing test cases. There are numerous techniques, like delta debugging [62] or program slicing [58], which can reduce the debugging effort, and integrating them into SpecTest would be interesting for future work.

#### 3.3 Threats to Validity

A threat to the validity of our evaluation might be that we did not show a comparison to other compiler testing methods. A comparison might be interesting, but our main goal was to show the general applicability and usefulness of SpecTest for different compilers. It would not be fair to compare SpecTest to other testing techniques that focus on different types of bugs, e.g., it might be much easier to find simple parsing errors caused by unusual characters (with techniques, like fuzzing).

One might argue that the test size we used is too limited, which might be a potential threat to the validity of our evaluation. It is true that it would make sense to apply more seed programs and to continue mutating and testing for an extended period of time. However, due to restrictions of KSolidity and K-Java, a larger set of seed programs was not supported, and due to a limited time and computing budget, we did not execute more tests. Nevertheless, we believe that our test size was reasonable, since it allowed us to reveal various issues and bugs.

Another threat to the validity of our evaluation might be that we should not have just relied on existing specifications, where we cannot be sure about their quality. It is true that we might have more confidence in a specification that we created, but since SpecTest checks the correctness of compilers as well as specifications, we have trust that our specifications had a reasonable quality.

#### 4 Related Work

Compiler testing is a broad research field with a range of techniques that target, e.g., the test case generation [49,31,23] or the oracle problem [22]. Several surveys give an overview of these methods [56,26,39,25]. Our study however shows that existing approaches suffer from two weaknesses. They do not apply a test case generation that can extensively cover rare language features, and they often rely on weak or limited test oracles. The test case generation often works with standard code coverage criteria concerning compiler components. For example, Zelenov and Zelenova [61] applied a BNF grammar as a model and produced test cases according to, e.g., code or functional coverage of a syntax analyser. A method based on the coverage of context-free grammar rules was presented by Purdom [49], but it only targets the parser of the compiler. Kalinov et al. [35,36] defined coverage criteria based on a statement machine specification. In contrast to our work, they do not identify rare language features by analysing semantic rule coverage, and they do not construct their test programs via code mutation.

Various compiler testing methods work without any coverage by just randomly generating test cases according to a grammar, which defines valid programs [52,60]. There are also techniques that use mutation for producing test cases [38,41,55]. For example, Le, Sun, and Su [41] produced mutants that should have the same behaviour as the original programs in order to find cases where the behaviour diverges. However, in contrast to our work, they are not considering a semantic coverage for less used language features.

Several attempts have been presented to answer the oracle problem for compiler testing. In the simple case of positive/negative testing, an oracle only tells whether a program is compilable. When a test program is compiled, the result is checked to see if it matches the expectation of the oracle. A match means a successful compilation. Otherwise, there may be a bug. For example, Zelenov and Zelenova [61] illustrated a specification-based approach for generating positive and negative tests. Such approaches are limited to testing the syntax parser.

In the line of work on differential testing compilers [45], the oracle is defined as consistency among two or more compilers for the same language. In this method, the same test programs are given to multiple compilers and the results are compared. If there is a difference then a bug in one of the compilers or an ambiguity in the language is found. There exist different versions of differential testing as explained by McKeeman [45]. Cross-compiler testing [52] is a technique that works by contrasting a new compiler against a pre-existing compiler that has the same specification. When the same test programs are executed with both compilers, a different result can reveal a fault in the new or pre-existing compiler. Sometimes this technique is also called randomised differential testing [60], because the test programs are usually generated randomly, e.g., based on a grammar. Another differential testing technique is cross-optimisation testing, where programs compiled with different optimisations implemented for the same compiler are contrasted to find bugs. Le, Sun, and Su [42] presented such a technique for stress testing link optimisers. Their method generates random test programs and injects various function calls into different code regions in order to increase dependencies between procedures, and it also randomly selects different optimisation levels to produce challenging tests for the optimiser. Cross-version or regression testing is another differential testing method that tries to find bugs by comparing different versions of the same compiler. For example, Sun, Le, and Su [54] developed Epiphron, a tool that generates random test programs to find inconsistencies with the debug information, like missing warning messages, in different versions of the same compiler. Such approaches work only if there are multiple relatively mature compilers for the same language. In contrast to these techniques, SpecTest works with a formal language specification which is especially useful when no compilers could be used as a reference. Moreover, different compilers or compiler versions for the same language might still suffer from the same bugs, which is unlikely for an independent specification.

There are approaches that assume the existence of a reference compiler, i.e., the oracle is an existing formally proven compiler. For example, Leroy [43] presented CompCert, a compiler for a subset of C, which was verified with the proof assistant Coq. However, there are usually no such compilers for a newly developed language and the existing ones cover only subsets of languages since formally proving a compiler is extremely challenging.

For metamorphic testing [57], the oracle is defined as certain algebraic properties of the compiler. For instance, one such property explored in the compiler testing technique called equivalence modulo inputs (EMI) [40,55] is that a modification on a program part which is never executed should not alter the result. Based on this simple oracle, EMI works by randomly pruning dead code (i.e., code which is not executed given a certain program input) or by randomly inserting or removing instructions from dead code based on a Markov Chain Monte Carlo method. Such approaches are limited to identifying bugs which violate the algebraic properties. Hence, they are not able to find deep semantic errors.

The closest related work to SpecTest was proposed by Kalinov et al. [35,36], where a language specification in the form of abstract state machines and montages is used as an oracle. With this specification, they compare the expected output from the specification to that of a compiled program in order to check whether there are compiler bugs. This approach is limited by the choice of the specification language and it quickly becomes infeasible, because the computation time is too high. Moreover, it is not concerned with semantic coverage.

To demonstrate the limitations of the closely related methods, we come back to the example of Sect. 2, i.e., we discussed a bug with the increment operator that we discovered during our analysis of the Solidity compiler.

```
int a = 1; int result = a + a++; //produces 3, but it should be 2
```
In this example, the compiler had an issue with the computation order, which resulted in wrong results. Existing approaches, like EMI or differential testing might be able to detect such issues, but with EMI it is difficult to find mutations that lead to such cases. The same is true for differential testing and there is also a high chance that different compiler versions have the same faulty behaviour for such a case (e.g., all versions of the Solidity compiler had this issue).

### 5 Conclusion

We have demonstrated our novel compiler testing technique called SpecTest that targets less-used language features. SpecTest is based on three components: an executable language specification, a fuzzer for generating test inputs, and a mutator which generates new programs by injecting rare language features. Comparing the abstract execution of the specification to the concrete execution of a compiled program enables our method to find deep semantic errors as well as inconsistencies and issues in the specification.

We evaluated SpecTest by applying it to two programming languages: Java and Solidity. The results are encouraging. We discovered various issues concerning the compilers and the language specifications. Some of them helped to improve the quality of the compilers and many will enhance the specifications.

In the future, we plan to further explore the generality of SpecTest for other languages, and we intend to consider different types of executable specifications.

#### Acknowledgments

This research is supported by the National Research Foundation Singapore under its NSoE Programme (Award Number.: NSOE-TSS2019-03).

# References


on the effectiveness of lightweight mechanization. In: Proceedings of the 39th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012, Philadelphia, Pennsylvania, USA, January 22-28, 2012. pp. 285–296. ACM (2012). https://doi.org/10.1145/2103656.2103691


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended

# **PASTA: An Efficient Proactive Adaptation Approach Based on Statistical Model Checking for Self-Adaptive Systems**

Yong-Jun Shin(-) , Eunho Cho , and Doo-Hwan Bae

Korea Advanced Institute of Science and Technology (KAIST) Deajeon, Republic of Korea {yjshin, ehcho, bae}@se.kaist.ac.kr

**Abstract.** Proactive adaptation, in which the adaptation for a system's reliable goal achievement is performed by predicting changes in the environment, is considered as an effective alternative to reactive adaptation, in which adaptation is performed after observing changes. When predicting the environmental changes, the prediction may be uncertain, so it is necessary to verify and confirm an adaptation's consequences before execution. To resolve the uncertainty, probabilistic model checking (PMC) has been utilized for verification of adaptation tactics' effects on the goal of a self-adaptive system (SAS). However, PMC-based approaches have limitations on the state-explosion problem of complex SAS model verification and the modeling languages supported by the model checkers. In this paper, to overcome the limitations of the PMC-based approaches, we propose an efficient Proactive Adaptation approach based on STAtistical model checking (PASTA). Our approach allows SASs to mitigate the uncertainty of the future environment, faster than the PMC-based approach, by producing statistically sufficient samples for verification of adaptation tactics based on statistical model checking (SMC) algorithms. We provide algorithmic processes, a reference architecture, and an open-source implementation skeleton of PASTA for engineers to apply it for SAS development. We evaluate PASTA on two SASs using actual data and show that PASTA is efficient comparing to the PMC-based approach. We also provide a comparative analysis of the advantages and disadvantages of PMC- and SMC-based proactive adaptation to guide engineers' decision-making for SAS development.

**Keywords:** Self-adaptive system · Proactive adaptation · Statistical model checking · Environmental uncertainty

# **1 Introduction**

As the complexity of an environment that affects a system's goal achievement increases, analyzing the environment becomes important for reliable goal achievement. The environment, such as user traffic and outdoor temperatures, can change over time [15,29]. Full anticipation of environmental changes at the system design time is challenging and often impossible [6,9]. Systems are required to be self-adaptive so that they change their behaviors and structures according to the environmental changes at runtime. To realize this, numerous design approaches [11,13,14,16] have been proposed based on the MAPE feedback loop [18]. These adaptation processes involve the continual monitoring and analysis of the environment as well as the planning and execution of the adaptation.

For most existing approaches, adaptation has been reactively triggered by system failures or changes in the environment [12,31,33]. Other adaptation approaches, known as proactive or predictive adaptation, have emerged, which have proven to be more effective than reactive adaptations in a changing environment by predicting changes in advance [2,24,26]; however, the prediction of environmental changes is uncertain, so the uncertainty affects the consequences of proactive adaptation. To resolve the uncertainty, probabilistic model checking (PMC) was utilized in some studies for the verification of adaptation tactics and their effects on the system's adaptation goal [5,26,27,28].

PMC-based approaches are a major method used for proactive adaptation; however, PMC may be not appropriate for the verification of large and complex self-adaptive system (SAS) models due to the state explosion problem. PMC requires a high verification cost in time and memory to fully examine the given probabilistic models, so the verification of complex SAS models and adaptation tactics may fail due to time and memory constraints. In addition, modeling languages supported by probabilistic model checkers must be used for the modeling of the SAS and the environment. Engineers must be familiar with modeling languages, such as Markov chains, Markov decision processes, or automata, that model checkers can interpret [21]. To overcome the limitations, we propose an efficient proactive adaptation approach based on statistical model checking (SMC) that consumes a smaller verification resource than PMC and only requires simulation results of system models without limiting languages.

Our Proactive Adaptation approach based on STAtistical model checking (PASTA) offers the following contributions:


The remainder of this paper is organized as follows. Section 2 introduces related work of proactive adaptation. Section 3 provides the background knowledge of SMC. Section 4 presents an illustrative example. Section 5 introduces our PASTA approach. Section 6 evaluates PASTA based on two SASs with actual data. Section 7 reveals the threats and validity of our work. Section 8 concludes the paper.

#### **2 Related Work: Proactive Adaptation**

Numerous studies on proactive or predictive adaptation have been conducted to address issues related to changing environments [3,20,24,25]. As opposed to reacting to changes in the environment or system, predicting and responding to the predicted situations could be more difficult but more effective in preventing system failures and meeting requirements. Many case studies on proactive adaptation have been conducted, and it has been demonstrated that proactive adaptation outperforms reactive adaptation in terms of the system's adaptation goal [2,10,20]. For proactive adaptation, the prediction of the future environment is uncertain, so approaches utilizing probabilistic model checking (PMC), which verifies the property satisfaction of probabilistic model, have been proposed to provide verified and trustworthy proactive adaptation results [5,26,27,28]. The main process of PMC-based proactive adaptation is illustrated in Fig. 1. Core of the process are the formal modeling of the future environment, system, and adaptation tactics, and the verification of the models to identify an optimal adaptation tactic for adaptation goal achievement. However, PMC is not appropriate for the verification of large and complex models due to its state explosion problem. It requires exhaustively examining all possible states of SAS models to verify adaptation tactics. It also requires engineers to develop SAS models written in modeling languages that model checkers can support. To tackle the limitations, as an alternative to PMC-based approaches, which have been the major trend of proactive adaptation, in this paper, we propose a statistical model checking (SMC)-based proactive adaptation approach [19,23,34].

**Fig. 1.** PMC-based proactive adaptation process

#### **3 Background: Statistical Model Checking (SMC)**

We have utilized statistical model checking (SMC) to verify adaptation tactics at runtime under an uncertain environment. SMC is an efficient technique for verifying a stochastic model [22,23]. Although PMC exhaustively examines the model, SMC simulates the model to obtain samples and provides statistical evidence of the satisfaction or violation of the given property using hypothesis testing for the samples. In fact, SMC requires only a set of simulation results, so it can be applied to an executable black-box model or to only a set of simulation results. The fact that the verification results depend on the quality of the model is the same as PMC. However, as it is a simulation-based approach, it is known to be an efficient alternative to PMC in terms of time and memory, performing verification with a certain confidence [1,19]. In this regard, SMC can be used effectively for the runtime verification of SAS adaptation tactics with uncertain environments. The following examples of SMC algorithms are widely used:


For the PASTA approach, an SMC algorithm is selected and used to obtain statistical evidence of an adaptation tactic's performance in a future environment to evaluate possible tactics and to identify the optimal tactic at runtime.

#### **4 Illustrative Example**

We illustrate PASTA using an adaptive air condition control system as an example. The system monitors indoor and outdoor air conditions, including temperature and humidity, and adaptively controls the indoor condition for a given target condition. Planning an adaptive air condition control with an immediate reaction to the monitored indoor condition can aid the system in achieving its goal; however, the indoor air conditions may change over time due to the influence of the outdoor air conditions, as shown in Fig. 2. If the adaptation plan is made without taking the environmental change into account, the adaptation consequences may differ from the expectations, and thus there could have been a better adaptation tactic that was not chosen. The air condition control system developed by the PASTA approach forecasts future air condition changes and selects an optimal adaptation tactic whose adaptation consequences are verified by SMC at runtime. Throughout this paper, we will describe our approach using this example.

**Fig. 2.** Adaptive air condition control system

# **5 Proactive Adaptation Based on Statistical Model Checking**

#### **5.1 PASTA overview**

We propose the PASTA approach, which is a proactive adaptation, using SMC. Fig. 3 presents the overall adaptation process. The aim of the approach is to provide efficient proactive adaptation based on the prediction of environmental changes and the verification of the adaptation tactics of the SAS. (Step 1) Initially, PASTA continuously monitors the environment to capture its change at runtime. (Step 2) It analyzes the monitored (historical) environment data and forecasts future environmental changes based on its forecasting algorithm. The prediction or expectation of the future environment is in the form of nondeterministic possibility, such as the probability density function of future environmental conditions. (Step 3) Based on the prediction, a sample of the possible future environment is made and given to the simulation engine as a simulation environment. (Step 4) In the given environment, an adaptation tactic is applied to the system model and simulated to make a sample evaluation of the tactic's performance. The simulations are repeated until the system obtains the statistically sufficient number of samples for the verification of the tactic's performance for the adaptation goal in the expected future environmental change. (Step 5) Based on the accumulated samples, the performance of an adaptation tactic is verified. All adaptation tactics are evaluated repeatedly in the same manner, and the SAS statistically guarantees the effects of its adaptation tactics. (Step 6 and 7) When all possible adaptation tactics have been evaluated, an optimal adaptation tactic is chosen and executed. This adaptation process is continuously repeated to respond to continuous environmental changes. We describe the PASTA approach in detail based on this adaptation process in the subsequent sections.

**Fig. 3.** Overall PASTA process

#### **5.2 Knowledge**

*Principle.* The PASTA approach requires an SAS to accumulate the monitored environment data. The accumulated historical environment data is analyzed to predict environmental changes. Furthermore, the system has its current system model that is an abstraction of the system behavior executable by a simulator. The model in PASTA is user-specific, and although the modeling language and system information to be modeled are selected by the engineer, the only requirement is that the model is executable to generate simulation logs. The system model also contains a finite set (space) of possible adaptation tactics that will be verified. An adaptation tactic is a specification of an adaptation that can be applied to the SAS and its model, such as a set of configurations. The adaptation goal is also specified in the knowledge. Thus, the optimal tactic for the adaptation goals will be selected and executed.

*Example.* The environmental factors of interest in the adaptive air condition control system are the indoor/outdoor temperature and humidity; therefore, the monitored environment data at a specific time include values of four factors. The simulation models imitate the changes of the indoor temperature and humidity affected by outdoor conditions and the air condition control system's control values. The system's possible adaptation tactics are defined by the system capabilities of each temperature and humidity control capability. For example, the system can increase or decrease the temperature and humidity in 0.1◦C and 0.1% increments up to 5◦C and 5%, respectively, in a discrete simulation time unit. The tactic space is a Cartesian product of the possible temperature and humidity controls. The adaptation goal is to manipulate the indoor temperature and humidity to the user's desired conditions.

#### **5.3 Monitoring Environmental Changes**

*Principle.* (Step 1) The system constantly monitors the environment. The environment is measured as the values of the environmental conditions observable by the sensors. The current environmental data are added to the environment database. The current state of the system is also monitored, and the system model is kept up to date.

*Example.* The air condition control system constantly monitors the indoor/outdoor temperature and humidity. It accumulates the environment data in its environment database.

#### **5.4 Forecasting Future Environmental Change**

*Principle.* (Step 2) PASTA forecasts future environmental changes based on the accumulated historical environment data using a data analysis or forecasting techniques. As the given historical environmental data consist of time-series data, a time-series analysis and forecasting methods, such as random walk [30], errortrend-seasonal [17], autoregressive integrated moving average model [7], or any machine-learning techniques, can be applied, and the choice of the forecasting methods depends on domain engineers. What is important here is that the predictions of future environmental changes based on historical data are uncertain, so the results of the forecasting are non-deterministic expectations, such as the probability density function of future environmental conditions. This uncertainty will be resolved by SMC.

*Example.* The system predicts the outdoor temperature and humidity changes, which exhibit distinct repetitive patterns (seasonality) at 24-hour intervals. As the environmental data of this system exhibit distinct seasonality, they can be predicted naively with a random walk model using seasonal differencing [17]. Based on the historical temperature data and the forecasting algorithm, the temperature change from the present to a few hours later can be predicted using the probability density function. For example, if the current temperature at 2 p.m. is 24◦C, the temperature at 3 p.m. can be expected to change according to the uniform distribution between 24◦C and 30◦C.

#### **5.5 Planning Adaptation Using SMC**



*Principle.* The adaptation planning of the PASTA approach involves searching for the optimal tactic among possible adaptation tactics using SMC, as shown in Algorithm 1. Evaluating an adaptation tactic using SMC consists of three steps: sampling environmental changes, simulating adaptation tactics, and verifying the simulation results. (Step 3) The forecasting result is non-deterministic, so the sample generator produces a deterministic sample of possible future environmental conditions based on the forecasting result. SMC eliminates the uncertainty of the nondeterministic future environment by producing statistically sufficient samples, while PMC probabilistically verifies a stochastic model. The number of samples is determined depending on the SMC algorithms, as explained in the background section. (Step 4) The simulator takes the sample environment, the system model, and an adaptation tactic as inputs. It applies the given tactic to the system model, simulates the system in the sample of the future environment, and returns a simulation result logs that represents the effects of the adaptation tactic in the future environment. (Step 5) The verifier receives the numerous simulation results and evaluates the tactic's performance for the adaptation goal represented as a verification property. This process is performed for all adaptation tactics, and (Step 6) the optimal tactic is selected based on all evaluation (verification) results. Therefore, the planning time required for an adaptation depends on the number of tactics, the number of required samples, and the time for a single simulation of the model.

*Example.* Based on the predicted range of the temperature change at 3 p.m. (24◦C ∼ 30◦C), the samples of the future outdoor temperature (for example, 25◦C, 27◦C, and 29◦C) are randomly selected by an SMC algorithm. The system model and an adaptation tactic (for example, lower the indoor temperature by 3◦C) under the current evaluation are simulated with the sample environments, respectively. Based on the simulation results, the verifier evaluates the adaptation results of the indoor temperature control. In this example, the average distance between the target condition and the current condition is used as a verification property representing an adaptation goal, but the maximum distance indicating the worst case, the presence or absence of events occurring with small probabilities, or any temporal logic can be used as verification properties [19,23,34]. When all possible temperature and humidity control tactics are verified (evaluated), the optimal one is selected.

#### **5.6 Executing Adaptation**

*Principle.* (Step 7) The chosen optimal adaptation tactic is applied to the managed system by the actuators of the system.

*Example.* The adaptive air control system operates the selected optimal temperature and humidity control. The controls affect the indoor conditions through the system's actuators.

#### **5.7 PASTA Implementation**

We also provide a PASTA reference architecture in Fig. 4 for the implementation of this approach. It is a layered architecture of an SAS with the PASTA approach. In the interaction layer, PASTA monitors the environment and managed system through the sensor and affects them through the actuators, like typical SASs. In the data analysis layer, there is a forecasting engine for the prediction of environmental changes and a knowledge management module for keeping the knowledge of the system up-to-date at all times. In the adaptation planner layer, a module searches for the optimal adaptation tactic through interactions with the adaptation verification layer. In the adaptation verification layer, the SMC module verifies an adaptation tactic governing the sample generator, the simulator, and the verifier.

The sample generator produces samples of the future environment based on the prediction of the forecasting engine. The simulator simulates the system model with an adaptation tactic in the given sample future environment. The verifier analyzes the simulation results to check the adaptation goal achievement, such as quality of service or invariant properties. In the knowledge layer, there is an environment database, a system model manager, an adaptation tactic repository, and an adaptation goal manager. This layer interacts with the others, providing and updating the knowledge of the SAS. This architecture is a reference, so it includes the essential components of an SAS with the PASTA approach and can be extended.

**Fig. 4.** PASTA reference architecture

In addition, to support engineers who develop SASs based on the PASTA approach, which was explained in the previous sections, we implemented a PASTA skeleton based on the reference architecture with guiding comments and released the source code on an open-source repository<sup>1</sup>. The skeleton is available in Java and Python. Engineers should write application-specific codes following comments tagged with "*todo*". The class diagram of the skeleton is presented in Fig. 5. An adaptation is activated by the "*adaptManagedSystem*" operator. It promotes easier PASTA implementation, allowing for the utilization of third-party libraries or tools for some components, such as the forecasting engine or the SMC module.

#### **6 Evaluation**

#### **6.1 Research Questions**

We demonstrate the feasibility of applying the PASTA approach as one efficient alternative to PMC-based proactive adaptation to SAS development. There are three research questions addressed.

*RQ1: (Cost efficiency of PASTA) How fast is PASTA's adaptation planning?* PASTA leverages SMC for efficient adaptation verification at runtime. Although almost all existing proactive adaptation approaches utilize PMC for the runtime verification of adaptation tactics, the PASTA approach is one of the most efficient alternatives to PMC-based proactive adaptation approaches. To determine the efficiency of PASTA, we compare the application planning time of PASTA and the PMC-based adaptation. We confirm the differences in time consumption between SMC- and PMC-based approaches in solving proactive adaptation problems of the same complexities.

*RQ2: (Adaptation planning accuracy of PASTA) How accurately does PASTA search for the optimal adaptation tactic?* PMC formally examines a probabilistic model and verifies whether it satisfies the given properties; however, SMC examines the given model with numerous sample simulation

<sup>1</sup> https://github.com/yongjunshin/PASTA

**Fig. 5.** Class diagram of the PASTA skeleton

results, so it returns the statistical evidence of the model's properties and thus has the inevitable limitation that it can return inaccurate verification results limited to the finite number of samples. It is known that SMC can produce results similar to PMC [19,23,34], and for this research question, we compare the similar proactive adaptation planning results of PASTA with the planning results of the PMC-based approach. We determine how much accuracy has been lost by the cost savings identified in RQ1 as well as whether the loss of accuracy is acceptable.

*RQ3: (Adaptation performance of PASTA) How effective is the adaptation goal achievement performance of PASTA?* For research question 3, we examine whether the PASTA approach is actually effective in achieving the adaptation goals of SASs. To evaluate the adaptation performance of PASTA, we compare the simulation execution results of approaches taking no adaptation, reactive adaptation, PMC-based proactive adaptation, and PASTA.

#### **6.2 Evaluation Setup**

We evaluate the PASTA approach using two example SASs. One is the adaptive air condition control system, the illustrative example of this paper, and the


**Fig. 6.** Adaptation tactic of traffic signal controller

other is an adaptive traffic signal controller of an intersection. The flow of cars in cities changes with the passage of time, which causes traffic congestion. A smart traffic signal controller that automatically controls traffic flow is a good example of applying proactive adaptation because changes in traffic conditions can be predicted based on historical data. Our signal controller predicts the traffic volume in an intersection and identifies an optimal configuration of signal patterns that minimizes the number of waiting vehicles. An actual signal controller is abstracted, and durations of signal patterns are dynamically controlled, as shown in Fig. 6. We applied PASTA to the two cases of different complexities and simulated them based on actual data acquired from public data repositories to make them realistic. Detailed descriptions of the two SASs and the evaluation setup are provided in Table 1.

We compared the adaptation cost, accuracy, and performance of the PASTA approach with the PMC-based proactive adaptation approach. The PMC-based proactive adaptation approach was implemented following a pioneering paper [26]. PRISM, a widely used probabilistic model checker, was utilized in the implementation [21]. We used default hybrid computation engine. The models of environments, systems, and tactics were specified in Markov decision processes (MDPs), and the adaptation goals were specified in the reward-based properties of the MDPs. As in paper [26], the following environmental changes have been predicted based on the data, and the PRISM modules have been constructed and verified based on the prediction. Thus, the optimal adaptation tactic has been found. In addition to the PMC-based approach, non-adaption and reactive adaptation approaches were also compared in terms of a system's goal achievement. For the PASTA approach, SMCS, the naivest SMC algorithm as explained in the background section, was implemented and evaluated by varying the number of samples used for the verification from 10 to 10000 (10, 100, 1000, 2000, ..., 9000, 10000).

#### **6.3 Evaluation Results**

**RQ1:** We measured and compared the time spent on adaptation planning for both case systems using the PASTA and PMC-based approaches. The adaptation planning time includes modeling or sampling time and probabilistic or statistical verification time to identify the optimal tactic. Figs. 7 and 8 show the


**Table 1.** SASs for evaluation

evaluation results for each system. The reported planning time is the average of 100 repeated experiments. The adaptation planning time for the PMC-based approach is constant, but the time for PASTA increases in proportion to the number of samples used for the SMC because the time for a single simulation is almost constant. Unfortunately, the traffic signal controller was not able to obtain adaptation planning results using PMC with a 2G memory because its models and tactics were more complex than the air condition control system so consume larger verification resource. Therefore, for the traffic signal controller,

**Fig. 7.** Adaptation planning cost - Air condition control system

**Fig. 8.** Adaptation planning cost - Traffic signal controller

the adaptation planning time for the PMC-based approach was not assigned; however, both systems confirmed that PASTA would complete adaptation planning much faster than the PMC-based approach. It was also confirmed that the adaptation planning time of PASTA is proportional to the number of samples and the complexity of the adaptation problem.

**RQ2:** To confirm the similarity of the optimal tactics that the PASTA and PMC-based approaches found, we compared the optimal tactics returned by the PASTA and PMC-based approaches in the same situation. To quantify the similarity, we defined two criteria. If the two tactics were the same, they were defined as *identical*, and if they were adjacent in terms of the tactic specifications, they were defined as *similar*. For example, for the air condition control system, temperature control tactics +3◦C and +3.1◦C were adjacent because the temperature control unit is 0.1C based on the system's capability, and the probability that arbitrarily two tactics are adjacent is less than 2%. Because the samples used by SMC are randomly generated, we repeated the PASTA experiments 100 times and report the percentage of identical or similar tactics compared to the tactic returned by the PMC-based approach. Because the traffic signal controller could not find the optimal tactic utilizing PMC, only the experimental results of the air condition controller are shown in Fig. 9. We could see that PASTA always found the same or similar optimal tactic as the PMC-based approach except when using 10 samples; however, one limitation of utilizing SMC is that regardless of how many samples we increased, we could not always obtain the same results as the PMC-based approach's results, which is considered an oracle. This case system returned accurate results at approximately 50% on average.

**Fig. 9.** Adaptation planning accuracy - Air condition control system

**RQ3:** For RQ1 and RQ2, we showed that PASTA can quickly find a suboptimal adaptation tactic that is similar to the PMC-based approach's result. For RQ3, we obtained simulation results to confirm the adaptation performance of the PASTA approach in comparison with non-adaptation, reactive adaptation, and PMC-based proactive adaptation. As shown in Fig. 10, the goal of the air condition control system was to keep the temperature at 25◦C, and proactive adaptation approaches showed a better adaptation performance than other strategies. In addition, the PASTA and PMC-based approaches exhibited a similar performance because PASTA has always made similar adaptation decisions to

**Fig. 10.** Adaptation performance - Air condition control system

**Fig. 11.** Adaptation performance - Traffic signal controller

the PMC-based approach. In Fig. 11, the goal of the traffic signal controller was to reduce the number of vehicles waiting at the intersection as much as possible, and proactive adaptation using PASTA showed the best performance. These two results demonstrate that proactive adaptation outperforms reactive adaptation and PASTA shows similar adaptation performance to the PMC-based approach with smaller verification cost.


**Table 2.** Comparison of proactive adaptation approaches

We compared two approaches of proactive adaptation: PMC-based and SMCbased (PASTA) approaches. As we confirmed in our evaluation, the two approaches have their own advantages and disadvantages, so engineers should carefully decide which to choose for their SAS development. We summarized our insights regarding their characteristics in Table 2 to guide engineers' decision making. As we emphasized, the SMC-based approach makes adaptation decisions, verifying a system's adaptation tactics faster than the PMC-based approach. In addition, if it is possible to generate simulation results from the given models, the modeling language is not limited to the model checker; however, it is indubitable that an adaptation decision made by the SMC-based approach may not be globally optimal. Therefore, the SMC-based approach may not be suitable for some safety-critical systems, and the PMC-based approach could be the better choice if the trustworthiness of the system is the most important concern. For SASs requiring a lower adaptation cost, such as real-time systems, PASTA is more appropriate than the PMC-based approach.

# **7 Threats to Validity**

One threat is the selection of the SMC algorithm. We selected SMCS to demonstrate the adaptation performance when selecting the simplest SMC algorithm. SMCS is suitable for explicitly indicating SMC-based adaptation costs affected by the number of samples, and all other SMC algorithms have similar characteristics. To reduce this threat, we also implemented SSP and SPRT and compared them to the PMC-based approach, and both showed similar cost, accuracy, and performance differences. Therefore, for this paper, only SMCS was selected and explained by varying the number of samples.

Another threat is the implementation of the PMC-based adaptation approach. We implemented the PMC-based approach directly following paper [26]. This threat was reduced because the authors published all the structures and codes of the PRISM module for the implementation of the approach. We implemented two case systems according to the PRISM module code shown in the paper. For a fair comparison, environment, system, and adaptation tactic spaces of the same complexities were given to both the PMC-based and PASTA approach.

#### **8 Conclusion**

We have proposed PASTA, a proactive adaptation approach using SMC, that is one efficient alternative to PMC-based proactive adaptation. We applied the PASTA approach to two realistic SASs. Through experiments based on actual data, we confirmed that PASTA would make an adaptation decision similar to the PMC-based proactive application approach in a shorter time. We then confirmed that the adaptation decision is more effective in achieving the system's goals than non-adaptation, reactive adaptation, and the PMC-based approach. Currently, PMC-based approaches are considered the major trend in proactive adaptation, but in this paper, we showed that the SMC-based proactive adaptation approach can be an efficient alternative. In addition, the algorithmic processes, reference architecture, and open-source skeleton of PASTA proposed in this paper will be of substantial help to developers who wish to apply PASTA to SAS development. This study was primarily conducted to validate the PASTA approach, but in the future, we plan to study methods such as effective sampling and adaptation space reduction for a more effective PASTA approach, and we also plan to apply PASTA to actual running systems.

#### **Acknowledgement**

This research is partly supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2020-2020-0-01795) supervised by the IITP(Institute of Information & Communications Technology Planning & Evaluation). This research is partly supported by IITP grant funded by MSIT (No. 2015-0-00250, (SW Star-Lab) Software R&D for Model-based Analysis and Verification of Higher-order Large Complex System). This research is partly supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by MSIT (2017M3C4A7066212).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Understanding Local Robustness of Deep Neural Networks under Natural Variations

Ziyuan Zhong , Yuchi Tian , and Baishakhi Ray

Columbia University, New York, NY, USA {ziyuan.zhong, yuchi.tian}@columbia.edu, rayb@cs.columbia.edu

Abstract. Deep Neural Networks (DNNs) are being deployed in a wide range of settings today, from safety-critical applications like autonomous driving to commercial applications involving image classifications. However, recent research has shown that DNNs can be brittle to even slight variations of the input data. Therefore, rigorous testing of DNNs has gained widespread attention.

While DNN robustness under norm-bound perturbation got significant attention over the past few years, our knowledge is still limited when natural variants of the input images come. These natural variants, e.g., a rotated or a rainy version of the original input, are especially concerning as they can occur naturally in the field without any active adversary and may lead to undesirable consequences. Thus, it is important to identify the inputs whose small variations may lead to erroneous DNN behaviors. The very few studies that looked at DNN's robustness under natural variants, however, focus on estimating the overall robustness of DNNs across all the test data rather than localizing such error-producing points. This work aims to bridge this gap.

To this end, we study the local per-input robustness properties of the DNNs and leverage those properties to build a white-box (DeepRobust-W) and a black-box (DeepRobust-B) tool to automatically identify the non-robust points. Our evaluation of these methods on three DNN models spanning three widely used image classification datasets shows that they are effective in flagging points of poor robustness. In particular, DeepRobust-W and DeepRobust-B are able to achieve an F1 score of up to 91.4% and 99.1%, respectively. We further show that Deep-Robust-W can be applied to a regression problem in a domain beyond image classification. Our evaluation on three self-driving car models demonstrates that DeepRobust-W is effective in identifying points of poor robustness with F1 score up to 78.9%.

Keywords: Deep Neural Networks · Software Testing · Robustness of DNNs.

#### 1 Introduction

Deep Neural Networks (DNNs) have achieved an unprecedented level of performance over the last decade in many sophisticated areas such as image recognition [38], self-driving cars [5] and playing complex games [65]. These advances

bird airplane cat dog bird bird bird bird Fig. 1: (a)-(d) A well-trained Resnet model [14] misclassifies the rotated variations of a bird image into three different classes though the original un-rotated image is classified correctly. (e)-(h) The same model successfully classifies all the rotated variants of another bird image from the same test set. The sub-captions consist of rotation degrees and the predicted classes. have also motivated companies to adapt their software development flows to incorporate AI components [3]. This trend has, in turn, spawned a new area of research within software engineering addressing the quality assurance of DNN components [11, 20, 32, 36, 40, 42, 55, 57, 73, 74, 91, 92].

Notwithstanding the impressive capabilities of DNNs, recent research has shown that DNNs can be easily fooled, i.e., made to mispredict, with a little variation of the input data [14, 23, 73]—either adding a norm-bound pixellevel perturbation into the original input [9, 23, 71], or with *natural* variants of the inputs, e.g., rotating an image, changing the lighting conditions, adding fog etc. [14, 52, 55]. The natural variants are especially concerning as they can occur naturally in the field without any active adversary and may lead to serious consequences [73, 92].

While norm-bound perturbation based DNN robustness is relatively wellstudied, our knowledge of DNN robustness under the natural variations is still limited—we do not know which images are more robust than others, what their characteristics are, etc. For example, consider Figure 1: although the original bird image (a) is predicted correctly by a DNN, its rotated variations in images (b)-(d) are mispredicted to three different classes. This makes the original image (a) very weak as far as robustness is concerned. In contrast, the bird image (e) and all its rotated versions (generated by the same degrees of rotation) in Figure 1:(f)-(h) are correctly classified. Thus, the original image (e) is quite robust. It is important to distinguish between such robust vs. non-robust images, as the non-robust ones can induce errors with slight natural variations.

Existing literature, however, focuses on estimating the overall robustness of DNNs across all the test data [4, 14, 88]. From a traditional software point of view, this is analogous to estimating how buggy a software is without actually localizing the bugs. Our current work tries to bridge this gap by localizing the non-robust points in the input space that pose significant threats to a DNN model's robustness. However, unlike traditional software where bug localization is performed in program space, we identify the non-robust inputs in the data space. As a DNN is a combination of data and architecture, and the architecture is largely uninterpretable, we restrict our study of non-robustess to the input space. To this end, we first quantify the local (per input) robustness property of a DNN. First, we treat all the natural variants of an input image as its *neighbors*. Then, for each input data, we consider a population of its neighbors and measure the fraction of this population classified correctly by the DNN - a high fraction of correct classifications indicates good robustness (Figure 1:e) and vice versa (Figure 1:a). We term this measure *neighbor accuracy*. Using this metric, we study different local robustness properties of the DNNs and analyze how the weak, *a.k.a.* non-robust, points differ characteristically from their robust counterparts. Given that the number of natural neighbors of an image can be potentially infinite, first we performed a more controlled analysis by keeping the natural variants limited to spatially transformed images generated by rotation and translation, following the previous work [4, 14, 88]. Such controlled experiments help us to explore different robustness properties while systematically varying transformation parameters.

Our analysis with three well-known object recognition datasets across three popular DNN models, i.e., a total of nine DNN-dataset combinations, reveal several interesting properties of local robustness of a DNN *w.r.t.* natural variants:


Based on these findings, we further develop two techniques (a black-box and a white-box) that can localize the points of poor robustness, thereby providing a means of, input-specific, real-time feedback about robustness to the end-user. Our white-box and black-box detectors can identify weak, *a.k.a.* non-robust, points with f1 score up to 91.4% and 99.1%, respectively, at neighbor accuracy cutoff 0.75. To further check the generalizability of our technique, we aim to detect weak points *w.r.t.* a self-driving car application where we generated natural input variants by adding rain and fog. Note that these are more complex image transformations, and also the model works in a regression setting instead of classification. These models take an image as input, and output a driving angle. Our white-box detector can identify weak points with f1 score up to 78.9%.

In summary, we make the following contributions:


### 2 Background: DNN Testing

Existing studies have proposed different techniques to generate test data inputs by perturbing input images for a DNN and use them to evaluate the robustness of the DNN. Depending on how the input image is perturbed, the techniques for generating DNN test data can be classified into three broad categories:

*i) Adversarial inputs* are typically generated by norm-based perturbation techniques [9, 23, 39, 46, 53, 85] where some pixels of an input image (I) are perturbed by norm-based distance (l1,l<sup>2</sup> or linf) such that the distance between the perturbed image and I is ≤ -, where is a small positive value. These adversarial examples are used to expose the security vulnerabilities of DNNs.

*ii) Natural variations* are generated through a variety of image transformations, and are used to evaluate the robustness of DNNs under such variations [13, 14, 73]. Sources of these variations include changes in camera configuration, or variations in background or ambient conditions. The transformations simulating these variations could be spatial, such as rotation, translations, mirroring, shear, and scaling on images, or non-spatial transformations, such as changes in the brightness or contrast of an image. Here we first focus on spatial transformations as opposed to adversarial one for two reasons. First, compared with adversarial examples, which is fairly contrived, spatial transformations are more likely to arise in more benign environments. Second, using simple parametric spatial transformations like rotations and translations, it is easier to systematically explore the local robustness properties. Later, to emulate a more natural variation we add fog and rain on the images of self-driving car dataset and evaluate our method's generalizibility.

*iii) GAN-based* image generation techniques use Generative Adversarial Network (GAN) to synthesize images. GAN is one class of generative models trained as a minimax two-player game between a generative model and a discriminative model [22]. GAN-based image generation has been successfully used to generate DNN test data instances [92, 93].

Standard Accuracy vs. Robust Accuracy. Standard accuracy measures how accurately an ML model predicts the correct classes of the instances in a given test dataset. Robust, *a.k.a.* adversarial accuracy, estimates how accurately an ML model classifies the generated variants [76]. In this paper, we adopt a pointwise robust accuracy measure, *neighbor accuracy*, to quantify the robustness of a DNN for the neighbors around each data point.

#### 3 Methodology

#### 3.1 Terminology

Original Data Point: An original data point represents an original un-modified data instance (image in our case) in the studied dataset. The original data points can come from training, validation, or testing dataset, depending on the experimental setting. In Figure 2, the triangle in the center is an original data point. Neighbors: Neighbors are images generated by the natural variations, e.g., spatial transformations applied to an original image. Since the transformation parameters are continuous (e.g., degree of rotations), there can be an infinite number of neighbors per image. In Figure 2, the small circles around an original data point represent its neighbors.

Neighbor Accuracy: We define *neighbor accuracy* as the percentage of its neighbors, including itself, that can be correctly classified by the DNN model. Figure 2 illustrates this; here, red small circles indicate misclassified neighbors, while the green small circles are correctly classified ones. The figure shows that there are only five neighbors per original data point. In the left-hand-side diagram, four out of five neighbors are correctly classified by the given DNN model. If the original data point is correctly classified as well, the neighbor accuracy of the original data is (5/6=) 83.3%. Similarly, in Figure 2 (right), four out of the five neighbors have been misclassified by the model; if the original data point is misclassified, the neighbor accuracy is (1/6=) 16.6%.

Robustness. An original data point is strong, *a.k.a.* robust, w.r.t. the DNN model under test if its neighbor accuracy is higher than a predefined threshold. Conversely, a weak, *a.k.a.* non-robust, point has the neighbor accuracy lower than a pre-defined threshold. For example, at 0.75 neighbor accuracy threshold, the black triangle in Figure 2 is a strong point, and the grey triangle is a weak point.

A region contains an original point and all of its neighbors. If the original point is strong (weak), we call the cor-

Fig. 2: Illustrating our terminologies. The triangles are original points, and the small circles are their neighbors generated by natural variations. The light-green region is robust with higher neighbor accuracy, while the light-red region is vulnerable. The corresponding original points are robust and non-robust accordingly.

responding region as a robust (weak) region. In Figure 2, the light green region is robust while the light red region is weak.

Neighbor Diversity: For multi-class classification task, different neighbors of an original point can be mis-classified to different classes. Neighbor Diversity score measures how many diverse classes a point's neighbors are classified, and is formally computed using Simpson Diversity Index (λ) [67]: λ = k <sup>i</sup>=<sup>1</sup> p<sup>2</sup> <sup>i</sup> (1)

where k is the total number of possible classes and p<sup>i</sup> is the probability of an image's neighbors being predicted to be class i. Large Simpson Index means low diversity. Let's consider we have three possible classes A, B, and C. Assume an image has 4 neighbors. Including the original image, there are 5 images in total. If two of the five images are classified as A, and rest are classified as B, then λ = (2/5)<sup>2</sup> + (3/5)<sup>2</sup> + (0/5)<sup>2</sup> = 0.52. In contrast, if two of them are classified as A, and two are classified as B, and one is classified as C then λ = (2/5)<sup>2</sup> + (2/5)<sup>2</sup> + (1/5)<sup>2</sup> = 0.36. Clearly, the latter case is more diverse and thus, has a lower λ score.

Feature Representation: In a DNN, the neurons' output in each layer capture different abstract representation of the raw input, which are commonly known as features, extracted by the current layer and all the preceding layers. Each layer's output forms the corresponding feature space. For a given input data point, we consider the output of the DNN's second-to-last layer as its feature representation or feature vector.

#### 3.2 Data Collection

Neighbor Generation: For the image classification tasks, for each original image point, we generate its neighbors by combining two types of spatial transformations: rotation and translation. We carefully choose these two types as representatives of non-linear and linear spatial transformations, respectively, following Engstrom et al. [14]. In particular, following them, we generate a neighbor by randomly rotating the original point by t (∈ [−30, 30]) degrees, shifting it by dx (about 10% of the original image's width i.e. ∈ [−3, 3]) pixels horizontally, and shifting it by dy (about 10% of the original image's height i.e. ∈ [−3, 3]) pixels vertically. It should be noted that for image classification it is standard in the literatures [14, 15, 86] to assume that the transformed image has the same label as the original one. As the transformation parameters are continuous, there can be infinite neighbors of an original data point. Hence, we sample m neighbors for each original data point. We explore the impact of m in RQ2.

For the self-driving-car task where the model predicts steering angle, for each original image point, we generate 50% neighbors with rain effect and the rest 50% with fog effects. We adopt a widely used self-driving car data augmentation package, Automold [60], for adding these effects where we randomly vary the degrees of the added effect. For the rain effect, we set "rain\_type=heavy" and everything else as default. For the fog effect, we set everything as default.

Estimating Neighbor Accuracy: To compute the neighbor accuracy of a data point for a given DNN model, we first generate its neighbor samples by applying different transformations—spatial for image classification and rain or fog for self-driving-car application. Then we feed these generated neighbors into the DNN model and compute the accuracy by comparing the DNN's output with the label of the original data point. For self-driving-car application, we follow the technique described in DeepTest [73]. More specifically, if the predicted steering angle of the transformed image is within a threshold to the original image, we consider it as correct. This ensures that any small variations of steering angle are tolerated in the predicted results. We then compute neighbour accuracy <sup>=</sup> #correct predictions original point+#total neighbours .

#### 3.3 Classifying Robust vs. Weak Points

We propose two methods, DeepRobust-W and DeepRobust-B, to identify whether an unlabeled input is strong or weak *w.r.t.* a DNN in real time. If a test image is identified as a weak point, although it may be classified correctly by the pre-trained model, this image is in a vulnerable region where a slight change to this image may cause the pre-trained DNN to misclassify the changed input.

DeepRobust-W: White-box Classifier This is a binary classifier designed to classify an image (in particular, image feature vector) as a strong or weak point. Here, we assume that we have white box access to the DNN under test to extract the feature vectors of the input images from the DNN. These feature vectors are given as inputs to DeepRobust-W. Figure 3 shows the workflow.

*Training*: During training of DeepRobust-W, we first feed all the original training images and their neighbors to the DNN under test. From the DNN outputs, we compute the neighbor accuracy for each data point in the training set and label each point strong/weak depending on whether its neighbor accuracy is higher/lower than a predefined threshold. For each original data point, we also extract the output of the DNN's second-to-last layer as its feature vector. We use these vectors as inputs to train DeepRobust-W and the outputs are the corresponding strong/weak labels.

*Testing*: Given a test input, we extract its feature vector by feeding the test image to the DNN under test and then feed the extracted feature vector to the trained DeepRobust-W, which predicts if the input is a strong or weak point.

DeepRobust-B: Black-box Classifier This is also a binary classifier that is intended to classify an image to strong/weak point. However, here the user does not have white box access to the DNN under test. Figure 4 shows the workflow.

Given a test input, we first randomly generate some of its neighbors. We then query the DNN under test with all these neighbors and compute the diversity score, as per Equation 1. If the neighbor diversity score (inversely correlated with neighbor diversity) is greater than a given diversity score threshold, the given test input is classified as a strong point; otherwise, a weak point.

Notice that, in this method, we do not need a training step. We only need the diversity score threshold, which can be empirically set using a ground-truth data set. In particular, we first calculate the neighbor accuracy and diversity score of each pre-annotated point. Next, based on a given neighbor accuracy threshold, we identify the weak points, as the ground truth. The highest diversity score among these weak points is chosen as the diversity score threshold.

Usage Scenario DeepRobust-W/B works in a real-world setting where a customer/user runs a pre-trained DNN model in real-time which constantly receives inputs and wants to test if the prediction of the DNN on a given input can be trusted. DeepRobust-W assumes that the user has white-box access to DNN under test and all the training data used to train the DNN. DeepRobust-W leverages the feature vector and neighbor accuracy of the training data to train the classifier, which can notify the user if the current input is a strong point or weak point. If the input is classified as strong point, the user can give more trust to the original DNN's prediction. On the other hand, if the point is classified as a weak point, the user may want to be more cautious about the DNN's prediction and conduct additional inspections.

In the blackbox setting, DeepRobust-B assumes the user does not have white-box access to DNN under test. DeepRobust-B comes with a small overhead of transforming the input multiple times to get some neighbors and querying DNN under test on them to estimate the diversity score.

# 4 Experimental Design

### 4.1 Study Subjects

Image Classification Similar to many existing works [36, 41, 61, 73, 74, 92] on DNN testing, in this work, we use image classification application of DNNs as the basis of our investigation. This is one of the most popular computer vision tasks, where the model tries to classify the objects in an image or video.

Datasets: We conduct our experiments on three image classification datasets: F-MNIST [87], CIFAR-10 [37], and SVHN [89].


Architectures: The popular DNN-based image classifiers are variants of convolutional neural networks (CNN) [28,38,79]. Here we study the following three architectures for all the three datasets:


– WRN: We use a structure with block type (3, 3) and depth 28 in [90] but replace the widening factor 10 with 2 for less parameters and faster training.

We train all the models from scratch using widely used hyper-parameters and achieve accepted level of validation natural accuracy). When training models on CIFAR-10, we pre-process the input images with random augmentation (random translation with dx, dy ∈ [−2, 2] pixels both horizontally and vertically) which is a widely used preprocessing step for this dataset. When training models on the other two datasets, plain images are directly fed into the models. The natural accuracies and robust accuracies of the models are shown in Table 1.

Steering Angle Prediction We further evaluate Deep-Robust-W in a self-driving car application to show that it can be applied into a regression task. These models learn to steer (i.e., predict steering

Table 1: Study Subjects (values are in percentage)


'Natural accuracy. \*Robust accuracy is estimated as the

average neighbor accuracy for test data points. angle) by taking in visual inputs from car-mounted cameras that record the driving scene, paired with the steering angles from a human driver.

Datasets: We use the dataset by Stocco *et al.* [68], which is collected by the authors driving on three tracks of different environments in the Udacity Simulator [77]. It consists of 37888 central camera training images and 9427 central camera evaluation images. Each image is of size 320x120.

Architectures: We evaluate our method on the three pre-trained DNN models used in [68]: NVIDIA DAVE-2 [6], Epoch [2], and Chauffeur [1]. These models have been used by many previous testing works on self-driving car [55, 68, 73].

#### 4.2 Evaluation

Evaluation Metric. We evaluate both DeepRobust-W and DeepRobust-B for detecting weak points under twelve and nine different DNN-dataset combinations, respectively, in terms of precision, recall, and F1 score. Let us assume that E is the number of weak points detected by our tool and A is the the number of true weak points in the ground truth set. Then the precision and recall are |A∩E| <sup>|</sup>E<sup>|</sup> and <sup>|</sup>A∩E<sup>|</sup> <sup>|</sup>A<sup>|</sup> , respectively. F1 score is a single accuracy measure that considers both precision and recall, and defined as <sup>2</sup>×precision×recall precision+recall . We perform each experiment for two thresholds of neighbor accuracy that defines strong vs. weak points: 0.75 and 0.50.

Baselines. We compare DeepRobust-W and DeepRobust-B with two baselines. One naive baseline (denoted *random*) is randomly selecting the same number of points as detected by our proposed method to be weak points. Another baseline (denoted *top1*) is based on prediction confidence score—if the confidence of a data point is higher than a pre-defined cutoff we call it a strong point, weak otherwise. This baseline is based on the intuition that DNNs might not be confident enough to predict the weak points.

# 5 Results

In this section, we elaborate on our results. In our preliminary experiments, we have two findings regarding neighbor accuracy. First, the neighbor accuracy vary widely across data points and there is a non-trivial number of points having relatively low neighbor accuracy. For example, for all the models trained on CIFAR-10 dataset, 40% of training data and 42% of testing data have neighbor accuracy <0.75, and 16% of training data and 20% of testing data have neighbor accuracy <0.50. These points degrade the aggregated spatial robustness of the model. The same finding holds for the other two datasets. Second, the distribution of neighbor accuracy for a dataset is similar across different models. For CIFAR-10, F-MNIST and SVHN, 60%, 76%, and 81%, respectively, of data points have neighbor accuracy change < 0.2 across any two models on the same dataset. This implies that a large portion of data points' neighbor accuracy is independent of the model selected.

The first observation shows that neighbor accuracy is a distinguishable measure for local robustness for the datasets and models we study. The second observation implies that the properties of points of low neighbor accuracy may be similar across models for each dataset. Following these two observations, we dive deeper and explore the characteristics of data points with different neighbor accuracy in RQ1. We then evaluate the performance of DeepRobust-W and DeepRobust-B which are developed based on the observations from RQ1 in RQ2 and RQ3, respectively. Finally, in RQ4, we evaluate the generalizability of our method by applying DeepRobust-W in a regression task for self-driving cars under more complex transformations.

### RQ1. What are the characteristics of the weak points?

We explore the characteristics of robust vs. non-robust points in their feature space. In particular, we check the difference in feature representations between: a) robust and non-robust points, and b) points with different degrees of robustness.

RQ1a. Given a well trained model, do the feature representations of robust and non-robust points vary? In this RQ, we first explore how robust (i.e., strong) and non-robust (i.e., weak) data points are distributed in the feature space.

We apply t-SNE[44], a widely used visualization method, to visualize the distribution of points of different neighbor accuracy in the representation space for all three datasets when using ResN as the classifier. Figure 5 shows the visualization of feature vectors from two randomly picked classes with colors indicating the neighbor accuracy of each point. The darker a point's color is, the lower its neighbor accuracy is. It is evident that most points of low neighbor accuracy tend to be further away from the class center.

To numerically verify this observation, first, we define a class center c<sup>k</sup> for each class k as the median value of the feature vectors of all the points from class k. Thus, if f<sup>i</sup> is the feature of a point at ith dimension and ˆfik is the median of the ith dimension features for all the points in class k, c<sup>k</sup> is defined to be ( ˆf<sup>1</sup>k, ..., ˆfjk, ..., ˆfnk).

(a) CIFAR-10 (b) F-MNIST (c) SVHN Fig. 5: The t-SNE plots of data points from two randomly chosen classes across three datasets using ResNet. Darker color indicates lower neighbor accuracy.

The reason we take median rather than mean is that it is a more statistically stable measure and is less likely to be heavily influenced by outliers in the representation space. Then, for every point p, we define a ratio: <sup>r</sup>(p) <sup>=</sup> <sup>d</sup>(*p*) *same*\_*class* d(*p*) *nearest*\_*other*\_*class* , where d(p) same\_class is the distance of the p-th point's

feature vector to its own class center and d(p) nearest\_other\_class is the distance of the p-th point's feature vector to the class center of its closest other class. A small r(p) means that the point p is close to its own class center while far from other classes, i.e., p is far from the decision boundary. In contrast, a larger r(p) indicates that the point p is closer to some other classes, i.e., it is closer to the decision boundary. Table 2: Weak and strong points ratio, and co-

We then measure the average r(p) among the weak points (denoted as rw) and among strong points (denoted as rs) for all three datasets across three models. Besides, we also calculate mann-whitney wilocox test[47] and cohen's d effect size [10] between the two ratios to test if the two ratios indeed have statistically signifi-

hen'sdeffect size


0.80 = large, 1.20 = very large, and 2.0 = huge [10, 59].

cant difference and how large the difference is.

As shown in Table 2, for both the neighbor accuracy cutoff (0.5 and 0.75), except one setting, the cohen's d effect size for every setting is larger than 0.50, which implies a medium to very large difference. Besides, for every setting, the mann-whitney wilocox test value (not shown in the table) is smaller than 1e−<sup>80</sup>, which implies the difference is indeed statistically significant.

The visualization and numerical results imply that most weak points are close to the decision boundaries between classes. Note that similar observation was also observed by Kim et. al. [36] in case of adversarial perturbation. In particular, they find that adversarial examples tend to be closer to class decision boundaries. In contrast, we focus on spatial robustness and find that spatially non-robust points are closer to decision boundaries.

RQ1b. Given a well trained model, do the feature representations of the data points vary by their degree of robustness? By analyzing the classifications of the neighbors of weak vs. strong points, we observe that the weaker a point is, its neighbors are more likely to be classified in different classes. We quantify this observation by computing diversity of the outputs a point's neighbor; We adopt Simpson Diversity Index (λ) [67] as defined in Equation (1).

Table 3 shows the Spearman correlation between neighbor accuracy and λ on the three datasets and three models for each. Note that while calculating the correlation, we remove points with neighbor



accuracy 100% since there are many points having 100% neighbor accuracy and tend to bias upward the Spearman Correlation; if we include points with neighbor accuracy 100%, the correlations become even higher. We notice that for any setting, the Spearman Correlation is never lower than 0.853. This indicates that neighbor accuracy and diversity are highly correlated with each other. For example, the bird image in Fig.1a has neighbor accuracy 0.49 and diversity 0.36, while the bird image in Fig.1e has neighbor accuracy 1 and diversity 1. This shows, the classifier tends to be confused about weak points and mispredicts them into many different kinds of classes.

Result 1: *In the representation space, weak points tend to lie towards the class decision boundary while the strong points lie towards the center. The weaker an image is, the model tends to be more confused by it, and classify its neighbors into more diverse classes.*

#### RQ2. Can we detect the weak points in a white-box setting?

We explore this RQ using DeepRobust-W, as discussed in Section 3.3. DeepRobust-W takes the feature vector of a data point as input and classifies it to a strong/weak point. We implement DeepRobust-W with a simple 4-layer, fully connected neural network architecture with hidden layer dimensions 1500, 1000, and 500, respectively.

Table 4 shows the result. At 0.75 setting, DeepRobust-W has F1 up to 91.4%, with an average of 76.9%. At 0.50 setting, DeepRobust-W detects weak points with average F1 of 61.1%, while it can go up to 79.1%. DeepRobust-W consistently performs significantly better than the baseline methods.

The top1 has very good precision, since a mis-classified image with low confidence tends to have very poor local robustness. However, there also exist many images that are correctly classified with high confidence yet have poor local robustness. The miss of these points leads the top1 to have very poor recall and thus even worse F1 compared with the random baseline. Our method comes to aid by providing high recall at the same time of decent precision.

Table 4: Performance of Deep-Robust-W and the baseline methods for predicting weak points.


Notice that DeepRobust-W's performance depends on the training data selection, mainly (a) how many weak vs. strong points are used to train the model, and (b) how many neighbors are generated per point to decide if it is strong/weak. To investigate (a), we assign a weight to each input point, indicating how likely it gets selected to train DeepRobust-W. In particular, for an input i, a weight <sup>w</sup><sup>i</sup> := 1+(1−n*i*)*m*×100*<sup>m</sup>* 1+100*<sup>m</sup>* is computed, where n is its neighbor accuracy, and m is a configurable parameter; with larger m, more weak points are sampled and Deep-Robust-W will be trained with more weak points, and vice versa.

Table 5A shows the performance: as m increases, the detector trades precision for recall. In this way, choosing different values of m, the precision-recall trade-off of the detector can be adjusted according to a user's need. From a different perspec-

tive, this way of oversampling weak points also addresses the potential problem of imbalanced data when the weak points are much less than the strong points. Table 5: DeepRobust-W performance using different sampling strategies for training


Next, we check how DeepRobust-W's performance is dependent on the number of sampled neighbors, because a data point can potentially have infinite neighbors. Table 5B shows that the number of neighbors does not have much influence on the performance of the detector once it goes beyond some value (F1 score change less than 3.5 percentage point between 25 and 200 samples) for all the three datasets. Thus, we choose 50 for all of our experiments. For future work, a statistical bound with confidence intervals for neighbor accuracy can be estimated by modeling neighbor accuracy using distributions like folded normal. Result 2: DeepRobust-W *can identify weak points with reasonably high F1 score: on average* 76.9%*, at* 0.75 *neighbor accuracy cut-off.*

#### RQ3. Can we identify the weak points in a black-box setting?

We explore this RQ using DeepRobust-B, as discussed in Section 3.3. We assume only having access to unlabeled testing data and the model under test as a black-box. To evaluate DeepRobust-B, we spatially transform each test input m times by randomly applying dω ∈ [−30, +30] degrees rotation, dx ∈ [−3, +3] pixels horizontal translation, and dy ∈ [−3, +3] pixels vertical translation. We then calculate the output diversity score (λ) based on Equation (1) and rank the test images based on λ. Finally, we mark top k images as potential most non-robust points. The parameter k is chosen according to users' need.

Fig. 6: The spearman correlation coeff. between diversity score (λ) and neighbor accuracy, with varying #neighbors (m).

With each test data, Deep-Robust-B queries the model with m neighbors to compute λ. Since querying the classifier comes with an overhead, our goal is to achieve an optimal

Fig. 7: AUC-ROC curve with neighbor accuracy cutoff at 0.75. The red vertical line indicates when the diversity score threshold is chosen from training data.

accuracy with minimal queries (i.e., m). To determine an optimal m value, we explore the spearman correlation between diversity score and neighbor accuracy, with varying m, when running ResN on all the three datasets (see Figure 6). The correlation increases as m increases, as with more query λ becomes more accurate, and so the neighbor accuracy. We notice that at m = 15, the correlation coefficients across all the experimental settings reach above 0.8, and the rate of increase begins to slow down significantly. The results for the other two architectures are highly similar. Thus, we set m = 15 as default for DeepRobust-B.

Next, we evaluate DeepRobust-B's performance. We plot AUC-ROC by changing top − k at m = 15 and compare our method with the random baseline and the top1 baseline as before. As shown in Figure 7, our method performs much better than the random baseline. In particular, our proposed method achieves AUC higher than 0.87 for all settings when neighbor accuracy cutoff is 0.5 and 0.97 when neighbor accuracy cutoff is 0.75.

Instead of above ranking based scheme, DeepRobust-B can also be used as a classifier if a diversity threshold is given (see Section 3.3). Here, we estimate the threshold using pre-annotated training data.



We evaluate precision and recall of DeepRobust-B in the nine DNN-dataset combinations under neighbor accuracy cutoffs 0.5 and 0.75. Table 6 shows the result. At 0.75 setting, DeepRobust-B has f1 up to 99.1%, with an average of 96.5%. At 0.50 setting, DeepRobust-B detects weak points with average f1 of 72.9%, while it can go up to 85.7%. It consistently produces better estimation than the top1 baseline and the random baseline. This shows that our black-box method can effectively identify weak points.

Note that, generating the spatial transformations and querying the model with it under black box setting is fast. Previous black box methods for adversarial perturbation work in such fashion [26,51]. For example, using CIFAR-10 , when we use a batch with size 100, the average transformation+query time for one image

is 0.031 ± 0.015 ms. For the other two datasets, the overhead is similar. Thus, to for m = 15 queries, it takes only 0.465 ± 0.225 ms, which is a negligible overhead for most real-world DNN based vision applications. This implies that our black-box method can also be used in real time for many applications.

Result 3: *Given only black-box access to the DNN classifier,* DeepRobust-B *can identify weak points with f1 that are much better than those of using top1 method or random method.*

#### RQ4. How generalizable are these findings?

The local robustness issues also exist in more critical applications like selfdriving-car. Here we explore more complex transformations, i.e., adding rain and fog to the driving scenes. As shown in Figure 8, among those correctly classified data points, there is a non-trivial portion (45.8%) of them (in the heatmap, more red signified weaker) suffer from low (<0.75) neighbor accuracy.

Note that, here, we test regression models, which take images of driving scenes as inputs and output the corresponding steering angles.

Let a set of outputs predicted by a DNN be denoted by {ˆθo<sup>1</sup>, <sup>ˆ</sup>θo<sup>2</sup>, ..., <sup>ˆ</sup>θon}, and ground truth labels for the original (unmodified) image points be {θ1, θ2, ..., θn}. If the difference between predicted steering angle ˆθoi of a transformed image and the ground truth label of the original image θ<sup>i</sup> is above a threshold, we consider it as incorrect.

The threshold λMSEorig is defined following DeepTest's [73] as MSEorig = 1 n n <sup>i</sup>=1(θ<sup>i</sup> <sup>−</sup> <sup>ˆ</sup>θoi)<sup>2</sup> . MSE is the Mean Square Error between the outputs and the manual labels, and λ is a positive coefficient that is chosen to reflect a user's tolerance on the deviation. Note that there is no softmax layer (and thus no confidence score) in these regression models so the top1 baseline method cannot be used here.

Table 7 shows the result when λ = 3. At 0.75 setting, DeepRobust-W has f1 score up to 78.9%, with an average of 58.2%. At 0.50 setting, DeepRobust-W detects weak points with an average f1 of 47.9%, while it can go up to 68.2%. It consistently produces better estimation than the random baseline under all the settings. It should be noted that our observation is valid for all the λ used in [73] from λ equal to 1 to 5. This shows that our proposed method DeepRobust-W can be applied to regression problems with more complex natural transformations.


Table 7: Performance of Deep-Robust-W for predicting weak points of Self-Driving dataset

Fig. 8: The t-SNE plot of correctly classified data points from Self-Driving dataset by the epoch model. data points are colored based on neighbor accuracy.

It should also be noted that it is unrealistic to use DeepRobust-B for this task for two reasons: It is impractical to try different variations of an image in real-time for a self-driving car, which is a time-sensitive application. Further, DeepRobust-B requires the calculation of neighbor diversity score. For a regression problem, the predicted values are continuous, so there is a very low probability for any two predictions being equal. Thus, the neighbor diversity score for every data point will be the same and cannot be used for identifying the weak points.

Result 4: DeepRobust-W *can detect weak points of a self-driving car dataset with f1 score up to 78.9%, with an average of 58.2%, at neighbor accuracy cutoff 0.75.*

# 6 Related Work

Adversarial examples. Many works focus on generating adversarial examples to fool the DNNs and evaluate their robustness using pixel-based perturbation [9, 17, 23, 25, 31, 36, 48, 49, 54, 63, 80–83]. Some other papers [14, 15, 86], like us, proposed more realistic transformations to generate adversarial examples. In particular, Engstrom et al. [14] proposed that a simple rotation and translation can fool a DNN based classifier, and spatial adversarial robustness is orthogonal to lp-bounded adversarial robustness. However, all these works estimate the overall robustness of a DNN based on its aggregated behavior across many data points. In contrast, we analyze the robustness of individual data points under natural variations and propose methods to detect weak/strong points automatically.

DNN testing. Many researchers [16, 21, 29, 36, 41, 55, 69, 70, 74, 94] proposed techniques to test DNN. For example, Pei et al. [55] proposed an image transformation based differential testing framework, which can detect erroneous behavior by comparing the outputs of an input image across multiple DNNs. Ferit et al. [16] used fault localization methods to identify suspicious neurons and leveraged those to generate adversarial test cases.

In contrast, others [8, 29, 64, 73, 78, 92, 94] used metamorphic testing where the assumption is the outputs of an original and its transformed image will be the same under natural transformations. Among them, some use a uncertainty measure to quantify some types of non-robustness of an input for prioritizing samples for testing / retraining [8] or generating test cases[78]. We follow a similar metamorphic property while estimating neighbor accuracy and our proposed DeepRobust-B also leverages an uncertainty measure. The key differences are: First, we focus on estimating model's performance on general natural variants of an input rather than the input itself or only spatial variants. Second, we focus on the task of weak points detection rather than prioritizing / generating test cases. We also give detailed analyses of the properties of natural variants and propose a feature vector based white-box detection method DeepRobust-W. Further, we show that our method works across domains (both image classification and self-driving car controllers) and tasks (both classification and regression). Other uncertainty work complement ours in the sense that we can easily leverage weak points identified by DeepRobust-W and DeepRobust-B to prioritize test cases or generate more adversarial cases of natural variants.

Another line of work [18, 19, 27, 33, 34, 58, 72] estimates the confidence of a DNN's output. For example, [19] leverages thrown away information from existing models to measure confidence; [27] shows other NN properties like depth, width, weight decay, and batch normalization are important factors influencing prediction confidence. Although such methods can provide a confidence measure per input or its adversarial variants, they do not check its natural robustness property, i.e., with natural variations how will they behave.

DNN verification. There also exist work on verifying properties for a DNN model [7, 12, 24, 30, 56, 62, 83]. Most of them focus on verifying properties on l<sup>p</sup> norm bounded input space. Recently, Balunovic et al.[4] provides the first verification technique for verifying a data point's robustness against spatial transformation. However, their technique suffers from scalability issues.

Robust training. Regular neural network training involves the optimization of the loss for each data point. Robust training of neural network works on minimizing the largest loss within a bounded region usually using adversarial examples [15,35,43,45,50,75,81,83,84]. While both robust training methods and our work generate variants of data points, instead of training a model with these variants to improve robustness, we use them to estimate the robustness of unseen data points. The relation between robust retraining and our work is similar to bug fixing vs. bug detection in traditional software engineering literature.

#### 7 Threats to Validity

We adopt rotation and translation as transformations for image classification tasks and rain and fog effects for the self-driving car task. There are many more natural variations such as brightness, snow effect etc. However, rotation and translation are representative of spatial transformation and used by many paper in evaluating robustness of DNN models[14, 55]. Rain and fog effects are also widely leveraged in many influential studies on testing self-driving cars [55,73,92].

Besides, for some of the experiments we did not show all the combinations under both neighbor accuracy cutoffs (i.e. 0.5 and 0.75). However, we note that the observations are consistent and we did not include them purely because of space limitation. Another limitation is that for both DeepRobust-W and DeepRobust-B, we need to decide the number of neighbors to use for training a classifier and estimating λ, respectively. We mitigate it by selecting the neighbor numbers that give stable performance in terms of precision and recall.

#### 8 Conclusion and Future Work

In this work, we involve the data characteristic into the robustness testing of DNN models. We adopt the concept of neighbor accuracy as a measure for local robustness of a data point on a given model. We explore the properties of neighbor accuracy and find that weak points are often located towards corresponding class boundaries and their transformed versions tend to be predicted to be more diverse classes. Leveraging these observations, we propose a white-box method and a black-box method to identify weak/strong points to warn a user about potential weakness in the given trained model in real-time. We design, implement and evaluate our proposed framework, DeepRobust-W and DeepRobust-B, on three image recognition datasets and one self-driving car dataset (for Deep-Robust-W only) with three models for each. The results show that they can effectively identify weak/strong points with high precision and recall.

For future work, other consistency analysis methods [18] e.g. variation ratio, entropy can be tried. We can potentially attain statistical guarantee for our black-box method by modeling the neighbor accuracy distribution and assume certain level of correlation between neighbor accuracy and complexity score. Besides, other definitions of robustness like consistency can be explored. We can also leverage ideas from [8,78] to easily prioritize test cases or generate more hard test cases based on identified weak points. Further, we can potentially modify existing fixing methods such as [20] targeting the weak points to fix them.

#### 9 Acknowledgement

We thank Mukul Prasad and Ripon Saha from Fujisu US for valuable discussions. This work is supported in part by NSF CCF-1845893 and CCF-1822965.

#### References


on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 510–520. ESEC/FSE 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3338906.3338955, https://doi.org/10.1145/3338906.3338955


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Test-Comp Contributions**

# Status Report on Software Testing: Test-Comp 2021

Dirk Beyer -

LMU Munich, Munich, Germany

Abstract. This report describes Test-Comp 2021, the 3rd edition of the Competition on Software Testing. The competition is a series of annual comparative evaluations of fully automatic software test generators for C programs. The competition has a strong focus on reproducibility of its results and its main goal is to provide an overview of the current state of the art in the area of automatic test-generation. The competition was based on 3 173 test-generation tasks for C programs. Each test-generation task consisted of a program and a test specification (error coverage, branch coverage). Test-Comp 2021 had 11 participating test generators from 6 countries.

Keywords: Software Testing · Test-Case Generation · Competition · Program Analysis · Software Validation · Software Bugs · Test Validation · Test-Comp · Benchmarking · Test Coverage · Bug Finding · Test-Suites

· BenchExec · TestCov

# 1 Introduction

Among several other objectives, the Competition on Software Testing (Test-Comp [4, 5, 6], https://test-comp.sosy-lab.org/2021) showcases every year the state of the art in the area of automatic software testing. This edition of Test-Comp is the 3rd edition of the competition. It provides an overview of the currently achieved results by tool implementations that are based on the most recent ideas, concepts, and algorithms for fully automatic test generation. This competition report describes the (updated) rules and definitions, presents the competition results, and discusses some interesting facts about the execution of the competition experiments. The setup of Test-Comp is similar to SV-COMP [8], in terms of both technical and procedural organization. The results are collected via BenchExec's XML results format [16], and transformed into tables and plots in several formats (https://test-comp.sosy-lab.org/2021/results/). All results are available in artifacts at Zenodo (Table 3).

This report extends previous reports on Test-Comp [4, 5, 6].

Reproduction packages are available on Zenodo (see Table 3).

Funded in part by the Deutsche Forschungsgemeinschaft (DFG) – 418257054 (Coop).

dirk.beyer@sosy-lab.org

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 341–357, 2021.

https://doi.org/10.1007/978-3-030-71500-7\_17

Competition Goals. In summary, the goals of Test-Comp are the following [5]:


Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [2]. We refer to the previous report [5] for a more detailed discussion and give here only the references to the most related competitions [2, 8, 32, 39].

Quick Summary of Changes. As the competition continuously improves, we report the changes since the last report. We list a summary of five new items in Test-Comp 2021 as overview:


# 2 Definitions, Formats, and Rules

Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [4]. In the following, we repeat some important definitions that are necessary to understand the results.

Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [5])

Test-Generation Task. A *test-generation task* is a pair of an input program (program under test) and a test specification. A *test-generation run* is a noninteractive execution of a test generator on a single test-generation task, in order to generate a test suite according to the test specification. A *test suite* is a sequence of test cases, given as a directory of files according to the format for exchangeable test-suites.<sup>1</sup>

Execution of a Test Generator. Figure 1 illustrates the process of executing one test generator on the benchmark suite. One test run for a test generator gets as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format. The test runs are executed centrally by the competition organizer. The test-suite validator takes as input the test suite from the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [15] <sup>2</sup> as test-suite validator.

Test Specification. The specification for testing a program is given to the test generator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp 2021).

The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [28]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks as follows: COVER( init(main()), FQL(COVER EDGES(@DECISIONEDGE)) ).

<sup>1</sup> https://gitlab.com/sosy-lab/software/test-format/

<sup>2</sup> https://gitlab.com/sosy-lab/software/test-suite-validator

Table 1: Coverage specifications used in Test-Comp 2021 (similar to 2019, 2020)


```
1 format_version: '2.0'
2
3 # old file name: floppy_true−unreach−call_true−valid−memsafety.i.cil.c
4 input_files: 'floppy.i.cil−3.c'
5
6 properties:
7 − property_file: ../properties/unreach−call.prp
8 expected_verdict: true
9 − property_file: ../properties/valid−memsafety.prp
10 expected_verdict: false
11 subproperty: valid−memtrack
12 − property_file: ../properties/coverage−branches.prp
13
14 options:
15 language: C
16 data_model: ILP32
```
Fig. 2: Example task definition file floppy.i.cil-3.yml for C program floppy.i.cil-3.c (format version and options are new compared to last year)

Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2021; there was no change from 2020 (except that special function \_\_VERIFIER\_error does not exist anymore).

Task-Definition Format 2.0. The format for the task definitions in the SV-Benchmarks repository was extended by options that can carry information from the test-generation task to the test tool. Test-Comp 2021 used the format in version 2.0 (https://gitlab.com/sosy-lab/benchmarking/task-definition-format/-/tree/2.0). The options now contain the language (C or Java) and the data model (ILP32, LP64, see http://www.unix.org/whitepapers/64bit.html, only for C programs) that the program of the test-generation task assumes (https://github.com/sosy-lab/sv-benchmarks#task-definitions). An example task definition is provided in Fig. 2: This YAML file specifies, for the C program floppy.i.cil-3.c, two verification tasks (reachability of a function call and memory safety) and one test-generation task (coverage of all branches). Previously, the options for language and data model where defined in category-specific configuration files (for example c/ReachSafety-ControlFlow.cfg), which were deleted before Test-Comp 2021.

License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [6]. Furthermore, the community tries to apply the SPDX standard (https://spdx.dev) to the SV-Benchmarks repository. Continuous-integration checks based on REUSE (https://reuse.software) will ensure that all benchmark tasks adhere to the standard.

### 3 Categories and Scoring Schema

Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasks <sup>3</sup>, which is also used by SV-COMP [8]. As in 2020, we selected all programs for which the following properties were satisfied (see issue on GitHub <sup>4</sup> and report [6]):


This selection yielded a total of 3 173 test-generation tasks, namely 607 tasks for category *Error Coverage* and 2 566 tasks for category *Code Coverage*. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site.<sup>6</sup> Figure 3 illustrates the category composition.

The programs in the benchmark collection contained functions \_\_VERIFIER\_error and \_\_VERIFIER\_assume that had a specific predefined meaning. Last year, those functions were removed from all programs in the SV-Benchmarks collection. More about the reasoning is explained in the SV-COMP 2021 competition report [8].

Category Error-Coverage. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. Every run will be started by a batch script, which produces for every tool and every test-generation task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.

<sup>3</sup> https://github.com/sosy-lab/sv-benchmarks

<sup>4</sup> https://github.com/sosy-lab/sv-benchmarks/pull/774

<sup>5</sup> https://test-comp.sosy-lab.org/2021/rules.php

<sup>6</sup> https://test-comp.sosy-lab.org/2021/benchmarks.php

Fig. 3: Category structure for Test-Comp 2021; compared to Test-Comp 2020, there are three new sub-categories in *Cover-Error* and two new sub-categories in *Cover-Branches*: we added the sub-categories *XCSP*, *BusyBox-MemSafety*, and *DeviceDriversLinux64-ReachSafety* to category *Cover-Error*, and the subcategories *XCSP* and *Combinations* to category *Cover-Branches*

Category Branch-Coverage. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [27]. Every run will be started by a batch script, which produces for every tool and every

test-generation task the coverage of branches of the program (as reported by TestCov [15]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.

Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time, which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [3], page 597).

### 4 Reproducibility

In order to support independent reproduction of the Test-Comp results, we made all major components that are used for the competition available in public version-control repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 4, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [6] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.

In order to guarantee long-term availability and immutability of the testgeneration tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see Table 3). The archive for the competition results includes the raw results in BenchExec's XML exchange format, the log output of the test generators and validator, and a mapping from file names to SHA-256 hashes. The hashes of the files are useful for validating the exact contents of a file, and accessing the files inside the archive that contains the test suites.

To provide transparent access to the exact versions of the test generators that were used in the competition, all test-generator archives are stored in a public Git repository. GitLab was used to host the repository for the test-generator archives due to its generous repository size limit of 10 GB.

Competition Workflow. As illustrated in Fig. 4, the ingredients for a test or verification run are (a) a test or verification task (which program and which specification to use), (b) a benchmark definition (which categories and which options to use), (c) a tool-info module (uniform way to access a tool's version string and the command line to invoke), and (d) an archive that contains all executables that are required and cannot be installed as standard Ubuntu package.

(a) Each test or verification task is defined by a task-definition file (as shown, e.g., in Fig. 2). The tasks are stored in the SV-Benchmarks repository and maintained by the verification and testing community, including the competition participants and the competition organizer.

(b) A benchmark definition defines the choices of the participating team, that is, which categories to execute the test generator on and which parameters to pass to the test generator. The benchmark definition also specifies the resource limits of the competition runs (CPU time, memory, CPU cores). The benchmark definitions are created or maintained by the teams and the organizer.

Fig. 4: Benchmarking components of Test-Comp and competition's execution flow (same as for Test-Comp 2020)

Table 2: Publicly available components for reproducing Test-Comp 2021


Table 3: Artifacts published for Test-Comp 2021


(c) A tool-info module is a component that provides a uniform way to access the test-generation or verification tool: it provides interfaces for accessing the version string of a test generator and assembles the command-line from the information given in the benchmark definition and task definition. The tool-info modules are written by the participating teams with the help of the BenchExec maintainer and others.

(d) A test generator is provided as an archive in ZIP format. The archive contains a directory with a README and LICENSE file as well as all components that are necessary for the test generator to be executed. This archive is created by the participating team and merged into the central repository via a merge request.

All above components are reviewed by the competition jury and improved according to the comments from the reviewers by the teams and the organizer.


Table 4: Competition candidates with tool references and representing jury members

Due to the reproducibility requirements and high level of automation that is necessary for a competition like Test-Comp, participating in the competition is also a challenge itself: package the tool, provide meaningful log output, specify the benchmark definition, implement a tool-info module, and troubleshoot in case of problems. Test-Comp is a friendly and helpful community, and problems are reported in a GitLab issue tracker, where the organizer and the other teams help fixing the problems.

To provide participants access to the actual competition machines, the competition used CoVeriTeam [13] (https://gitlab.com/sosy-lab/software/coveriteam/) for the first time. CoVeriTeam is a tool for cooperative verification, which enables remote execution of test-generation or verification runs directly on the competition machines (among its many other features). This possibility was found to be a valuable service for trouble shooting.

#### 5 Results and Discussion

For the third time, the competition experiments represent the state of the art in fully automatic test generation for whole C programs. The report helps in understanding the improvements compared to last year, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.

Participating Test Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2021. (The competition jury consists of the chair and one member of each participating team.) Table 5 lists the features and technologies that are used in the test generators. An online table with information about all participating systems is provided on the competition web site.<sup>7</sup>

<sup>7</sup> https://test-comp.sosy-lab.org/2021/systems.php


Table 5: Technologies and features that the competition candidates used

Computing Resources. The computing environment and the resource limits were the same as for Test-Comp 2020 [5]: Each test run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. The test-suite validation was limited to 2 processing units, 7 GB of memory, and 5 min of CPU time. The machines for running the experiments are part of a compute cluster that consists of 168 machines; each test-generation run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86\_64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BenchExec [16] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloud <sup>8</sup> to distribute, install, run, and clean-up test-case generation runs, and to collect the results. The values

<sup>8</sup> https://vcloud.sosy-lab.org


Table 6: Quantitative overview over all results; empty cells mark opt-outs; label 'new' indicates first-time participants

for time and energy are accumulated over all cores of the CPU. To measure the CPU energy, we use CPU Energy Meter [17] (integrated in BenchExec [16]). Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions. <sup>9</sup>

One complete test-generation execution of the competition consisted of 34 903 single test-generation runs. The total CPU time was 220 days and the consumed energy 56 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 34 903 single test-suite validation runs. The total consumed CPU time was 6.3 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 210 632 test-generation runs (consuming 1.8 years of CPU time) and 207 459 test-suite validation runs (consuming 27 days of CPU time). We did not measure the CPU energy during preruns.

Quantitative Results. Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of testgeneration tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category

<sup>9</sup> https://gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp21


Table 7: Overview of the top-three test generators for each category (measurement values for CPU time and energy rounded to two significant digits)

(perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site <sup>10</sup> and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column 'CPU Time') is given in hours and the consumed energy (column 'Energy') is given in kWh.

Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [16] because these visualizations make it easier to understand the results of the comparative evaluation. The web site <sup>10</sup> and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category *Overall* (all test-generation tasks) in Fig. 5. All 11 test generators participated in category *Overall*, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [3]). A more detailed discussion of score-based quantile plots for testing is provided in the previous competition report [6].

Alternative Rankings. Table 8 is similar to Table 7, but contains the alternative ranking categories *Green Testing* and *New Test Generators*. Column 'Quality' gives the score in score points (sp), column 'CPU Time' the CPU usage in hours (h), column 'CPU Energy' the CPU usage in kilo-watt-hours (kWh), and column 'Rank Measure' reports the values for the rank measure, which is different for the two alternative ranking categories. (An entry '–' for 'CPU Energy' indicates that we did not measure the energy consumption for technical reasons.)

<sup>10</sup> https://test-comp.sosy-lab.org/2021/results

Fig. 5: Quantile functions for category *Overall*. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by test-generation runs below a certain number of test-generation tasks (y-coordinate). More details were given previously [6]. The graphs are decorated with symbols to make them better distinguishable without color.

Table 8: Alternative rankings; quality is given in score points (sp), CPU time in hours (h), energy in kilo-watt-hours (kWh), the first rank measure in kilojoule per score point (kJ/sp), and the second rank measure in score points (sp); measurement values are rounded to 2 significant digits


*Green Testing — Low Energy Consumption.* Since a large part of the cost of test generation is caused by the energy consumption, it might be important to also consider the energy efficiency in rankings, as complement to the official Test-Comp ranking. This alternative ranking category uses the energy consumption per score point as rank measure: CPU Energy Quality , with the unit kilo-joule per

Fig. 6: Number of evaluated test generators for each year (top: number of first-time participants; bottom: previous year's participants)

score point (kJ/sp).<sup>11</sup> The energy is measured using CPU Energy Meter [17], which we use as part of BenchExec [16].

*New Test Generators.* To acknowledge the test generators that participated for the first time in Test-Comp, the second alternative ranking category lists measures only for the new test generators, and the rank measure is the quality with the unit score point (sp). For example, CMA-ES Fuzz is an early prototype and has already obtained a total score of 411 points in category *Cover-Branches*, and FuSeBMC is a new tool based on some mature components and became second place already in its first participation. This should encourage developers of test generators to participate with new tools of any maturity level.

# 6 Conclusion

Test-Comp 2021 was the the 3rd edition of the Competition on Software Testing, and attracted 11 participating teams (see Fig. 6 for the participation numbers and Table 4 for the details). The competition offers an overview of the state of the art in automatic software testing for C programs. The competition does not only execute the test generators and collect results, but also validates the achieved coverage of the test suites, based on the latest version of the test-suite validator TestCov. As before, the jury and the organizer made sure that the competition follows the high quality standards of the FASE conference, in particular with respect to the important principles of fairness, community support, and transparency.

Data Availability Statement. The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. Furthermore, the results are presented online on the competition web site for easy access: https://test-comp.sosy-lab.org/2021/results/.

<sup>11</sup> Errata: Table 8 of last year's report for Test-Comp 2020 contains a typo: The unit of the energy consumption per score point is kJ/sp (instead of J/sp).

#### References


Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **CoVeriTest with Adaptive Time Scheduling (Competition Contribution)***-*

Marie-Christine Jakobs<sup>1</sup>(-) and Cedric Richter<sup>2</sup>

<sup>1</sup> Technical University of Darmstadt, Department of Computer Science, Darmstadt, Germany, jakobs@cs.tu-darmstadt.de

<sup>2</sup> Paderborn University, Paderborn, Germany, cedricr@mail.upb.de

**Abstract.** CoVeriTest, which is integrated in the analysis framework CPAchecker, adopts verification technology for test-case generation. It encodes individual test goals as reachability queries, which are then processed by verifiers. To increase the effectiveness on a broad class of testing tasks, CoVeriTest leverages the strengths of two different analyses: an explicit value analysis and predicate abstraction. Similar to TestComp'20, the two analyses are interleaved and the time duration of an interleaving segment is calculated dynamically. However, the calculation of the time duration focuses on the predicted future performance instead of the past performance, thus, rewarding analyses that likely cover open test goals.

**Keywords:** Test-case generation · Cooperative Verification · CPAchecker

# **1 Test-Generation Approach**

Generating test-cases for a diverse set of tasks like in TestComp is challenging and often cannot be performed effectively by a single approach. Therefore, cooperative approaches that combine the strengths of multiple test-case generators frequently show superior performance as long as they do not spend too much time in unproductive test-case generators. To avoid unproductive test-case generation, we equip our CoVeriTest submission with a novel learning-based scheduler that considers the expected productiveness of a test-case generator.

CoVeriTest is a hybrid approach based on the concept of cooperative, verification-based testing [5], which combines complementary verifiers. In our current instantiation, we iteratively run two verification algorithms, namely value analysis [4] and predicate analysis [3]. In each iteration, the analyses proceed their exploration until they hit their time limit. The time limit of an analysis is computed dynamically at the beginning of each iteration round using our novel learning-based time scheduler. To generate test cases, we encode the (open) test

<sup>-</sup> This work was funded by the Hessian LOEWE initiative within the Software-Factory 4.0 project and it was partially supported by the German Research Foundation (DFG) within the Collaborative Research Centre "On-The-Fly Computing" (SFB 901) (grant number 160364472).

<sup>-</sup>jury-member

c The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 358–362, 2021. https://doi.org/10.1007/978-3-030-71500-7 18

**Fig. 1.** Our adaptive scheduler integrated in the workflow of CoVeriTest

goals, which are shared between the analyses, as unreachability queries and let the analyses prove the unreachability of those goals. A reported counterexample proves the reachability of a test goal. Therefore, the counterexample is converted into a test [1] and the test goal is removed from the set of open test goals.

**Time Scheduling.** Our time scheduler limits the time per iteration round to 100 s<sup>3</sup> and distributes the 100 s based on the expected contribution of the individual analyses. The idea is that an analysis gets more time if there exists more paths to open test goals that the analysis is expected to handle well.

Figure 1 shows the integration of our time scheduler into the CoVeriTest workflow. First, the scheduler samples a set of syntactical counterexample paths ρ, which starts at the beginning of the program and ends in an open test goal. Then, it estimates for each path ρ the probability P(V<sup>i</sup> | ρ) that analysis i detects ρ as a real counterexample<sup>4</sup>. We estimate the probability <sup>P</sup>(V<sup>i</sup> <sup>|</sup> <sup>ρ</sup>) using an unigram language model [9] in combination with the approach of Richter et al. [10] for the abstraction of the syntactical paths ρ. Finally, the scheduler assigns a time budget to analysis i in proportion to the average probability of detecting a counterexample path on a testing task T (program plus open test goals):

$$\text{Minit}\_{i}^{\text{new}} = 10s + 80s \* \mathbb{E}\_{\rho \in T} [P(V\_i \mid \rho)] \tag{1}$$

**Learning Probability Distribution.** The probability distribution P(V<sup>i</sup> | ρ) is unknown. Thus, we aim to learn the distribution. To this end, we executed the value and predicate analysis separately on the TestComp'20 category coveragebranches and used the reported counterexamples, which are obviously counterexamples that can be decided by the reporting analysis, to pre-train our unigram language model [9]. At the beginning of each CoVeriTest execution, we load the pre-trained model and use the reported counterexamples to improve it during

<sup>3</sup> We choose the same iteration time limit as in TestComp'20 [8], which has been established by extensive evaluation of CoVeriTest [5].

<sup>4</sup> Note that it is not important that ρ is a real counterexample. We rather model the probability that the analysis i can decide whether ρ is a counterexample than to decide whether ρ is a counterexample.

execution. When the sampled paths are indecisive, <sup>E</sup><sup>ρ</sup>∈<sup>T</sup> [P(V<sup>i</sup> <sup>|</sup> <sup>ρ</sup>)] becomes the normalized progress used in the TestComp'20 strategy [8]. The normalized progress describes the relative contribution of an analysis to the goals covered in the last iteration.

# **2 Tool Architecture**

CoVeriTest is an extension of the software analysis framework CPAchecker [2] (version 2.0) and is written in Java. For parsing, we use the Eclipse CDT parser<sup>5</sup>. For test-case generation, we rely on two instances of CPAchecker's test-case generation algorithm, which extracts test cases from counterexamples [1]. One instance generates test cases based on CPAchecker's value analysis [4] and the other instance uses CPAchecker's predicate analysis [3]. Both analyses apply counterexample-guided abstraction refinement [7] and use the SMT solver Math-SAT5 [6]. We interleave the two instances and determine their time slices based on their expected success on the set of open test goals. To determine the time slices, we added the adaptive scheduler described in the previous section.

# **3 Strengths and Weaknesses**

The main difference between CoVeriTest versions in Test-Comp'20 and Test-Comp'21 is the distribution of the 100 s per round. Our own experiments with the Test-Comp 2020 benchmark set revealed a small advantage for our new distribution with respect to the coverage-branches category. Comparing the competition results against a CoVeriTest configuration using the time distribution from Test-Comp'20 shows that the new distribution performs slightly worse in the coverage-error category. In total, 13 errors are missed, 8 of them are missed in the subcategory Floats. Overall, an advantage of the new distribution is scarcely noticeable on the Test-Comp 2021 benchmark set. The unigram language model does not generalize well.

Since the underlying analyses remain the same, CoVeriTest still generates a small number of test cases. Also, the problems with tasks using large arrays and the subcategories BusyBox-Memsafety and SQLite-Memsafety remain. Additionally, CoVeriTest performs poorly on the new ntdrivers tasks and the new subcategory Combinations. While finding the error in the new nla-digbench tasks is difficult, covering branches works well for these tasks. Moreover, CoVeriTest deals well with the new category XCSP and the remaining new tasks.

# **4 Setup**

We develop our extension of CoVeriTest in a fork<sup>6</sup> of CPAchecker and submitted revision 970d550, which participated in all categories. To run CoVeriTest

<sup>5</sup> https://www.eclipse.org/cdt/

<sup>6</sup> https://github.com/cedricrupb/cpachecker

on program program.i, one requires a Java 11 runtime environment and must execute the following command line:

```
scripts/cpa.sh -testcomp21 -setprop log.consoleLevel=SEVERE -stats
   -benchmark -heap 10000m -spec property.prp program.i
```
Note that property.prp is a place marker for the test specification (coverage- -error-call.prp or coverage-branches.prp). Tests are generated for programs assuming a 32-bit environment. To support 64-bit environments, one needs to add the configuration option -64. The generated tests are written to the folder output/test-suite and adhere to the XML format demanded by the Test-Comp rules. Additionally, the folder contains the mandatory metadata file.

#### **5 Project and Contributors**

CoVeriTest is an extension of the CPAchecker project<sup>7</sup> and is developed as a joint, open source project between research groups of Paderborn University and TU Darmstadt. Contributors are Marie-Christine Jakobs and Cedric Richter. We also like to thank all developers of CPAchecker.

### **References**


<sup>7</sup> https://cpachecker.sosy-lab.org/


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# FuSeBMC**: A White-Box Fuzzer for Finding Security Vulnerabilities in C Programs (Competition Contribution)**

Kaled M. Alshmrany(-)1 , Rafael S. Menezes<sup>2</sup> , Mikhail R. Gadelha<sup>3</sup> , and Lucas C. Cordeiro<sup>4</sup>

> University of Manchester, Manchester, UK Institute of Public Administration, Jeddah, Saudi Arabia kaled.alshmrany@manchester.ac.uk Federal University of Amazonas, Manaus, Brazil SIDIA Instituto de Ciˆencia e Tecnologia, Manaus, Brazil University of Manchester, Manchester, UK

**Abstract.** We describe and evaluate a novel white-box fuzzer for C programs named FuSeBMC, which combines fuzzing and symbolic execution, and applies Bounded Model Checking (BMC) to find security vulnerabilities in C programs. FuSeBMC explores and analyzes C programs (1) to find execution paths that lead to property violations and (2) to incrementally inject labels to guide the fuzzer and the BMC engine to produce test-cases for code coverage. FuSeBMC successfully participates in Test-Comp'21 and achieves first place in the Cover-Error category and second place in the Overall category.

**Keywords:** Automated Test-Case Generation · Symbolic Execution · Bounded Model Checking · Fuzzing · Security.

# **1 Test Generation Approach**

Automated test-case generation is a method to check whether the software matches expected requirements [2]. It involves the automated execution of software components to evaluate intricate properties and achieve code coverage metrics (e.g., decision, branch, instruction). Here, we describe and evaluate a novel white-box fuzzer, FuSeBMC, capable of automatically producing test-cases for C programs. FuSeBMC provides an innovative software testing framework that detects security vulnerabilities in C programs by using fuzzing and symbolic execution in combination with Bounded Model Checking (BMC) (cf. Fig. 1). FuSeBMC builds on top of clang [1] to instrument the C program, uses Map2check [8] as a fuzzing engine, and ESBMC (Efficient SMT-based Bounded Model Checker) [4,5] as BMC and symbolic execution engines, thus combining dynamic and static verification techniques.

<sup>-</sup>Jury Member

<sup>©</sup> The Author(s) 2021

E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 363–367, 2021. https://doi.org/10.1007/978-3-030-71500-7 19

Fig. 1: FuSeBMC: a white-box fuzzer framework for C Programs.

FuSeBMC takes a C program and a test specification [3] as input. In the Cover-Error category, FuSeBMC invokes the fuzzing and BMC engines sequentially to find a path that violates a given property. It uses an iterative BMC approach that incrementally unwinds the program until it finds a property violation or exhausts time or memory limits. FuSeBMC uses incremental BMC to explore the program state space searching for a property violation since all programs in Test-Comp'21 are known to have issues. In the Cover-Branches category, FuSeBMC explores and analyzes the target C program using the clang compiler to inject labels incrementally. FuSeBMC will compute all branches of the C code and inject the labels for each branch by adding the label GOAL-N, where N is the goal number. Both engines will check whether these injected labels are reachable to produce test-cases for branch coverage.

FuSeBMC analyzes the counterexamples and saves them as a *graphml* file. It checks whether the fuzzing and BMC engines could produce counterexamples for both categories Cover-Error and Cover-Branches. If that is not the case, FuSeBMC employs a second fuzzing engine named selective fuzzer which produces test-cases for the rest of the labels. The selective fuzzer produces test-cases by learning from the two engines' output: it analyzes the range of the inputs that should be passed to examine the target C program and then produces different test-cases. Lastly, FuSeBMC prepares valid test-cases with metadata to test a target C program using TestCov [3] as a test validator.

FuSeBMC sets a 150 seconds limit for the fuzzing engine and a 700 seconds limit for the BMC engine and sets a 50 seconds limit for the selective fuzzer. These numbers were obtained empirically by analyzing the Test-Comp'21 results.

#### **2 Strengths and Weaknesses**

Incremental BMC allows FuSeBMC to keep unwinding the program until a property violation is found or time or memory limits are exhausted. This approach is advantageous in the Cover-Error category as finding one error is the primary goal. Another strength of FuSeBMC is that it can accurately model C programs that use the IEEE floating-point arithmetic [6,7]. The floating-point encoding layer in our BMC engine extends the support for the SMT FP theory to solvers that do not support it natively. FuSeBMC can test programs with floating-point arithmetic using all currently supported solvers in BMC engine (ESBMC), including Boolector [9], which does not support the SMT FP theory natively.

In both Cover-Error and Cover-Branches categories, various test-cases produced by FuSeBMC are validated successfully. The majority of our test-cases were produced by the BMC engine and the selective fuzzer; our fuzzing engine did not produce many test-cases because it does not model the C library, so it mostly guesses the inputs. For example, in the Cover-Error category, TestCov confirms 500 test-cases produced by FuSeBMC, where our fuzzing engine produces 13 (Map2Check), BMC engine produces 393 (ESBMC), while our selective fuzzer produces 94 test-cases (selective).

However, note that our fuzzing engine is not limited to only produce testcases. It helps our selective fuzzer by providing information about the number of inputs required to trigger a property violation, i.e., the number of assignments required to reach an error. In several cases, the BMC engine can exhaust the time limit before providing such information, e.g., when there are large arrays that need to be initialized at the beginning of the program. For example, consider the following code fragment extracted from the standard copy1 ground-2.c benchmark, as illustrated in Fig. 2.

```
1 #define N 100000
2 ...
3 int a , a1 [N] , a2 [N] ;
4 for (a = 0 ; a < N ; a++) {
5 a1 [ a ] = VERIFIER nondet int () ;
6 a2 [ a ] = VERIFIER nondet int () ;
7 }
8 ...
9 for ( int x=0 ; x < N ; x++)
10 VERIFIER assert ( a1 [ x ] == a2 [ x ] ) ;
```
Fig. 2: Code fragment that contains a large array.

In this particular example, ESBMC exhausts the time limit before checking the assertion a1[x] == a2[x]. Apart from that, our employed verification engines also demonstrate a certain level of weakness to produce test-cases due to the many optimizations we perform when converting the program to SMT. In particular, two techniques affected the test-case generation significantly: *constant folding* and *slicing*. *Constant folding* evaluates constants (which includes nondeterministic symbols) and propagates them throughout the formula during encoding, and *slicing* removes expression not in the path to trigger a property violation. These two techniques can significantly reduce SMT solving time. However, they can remove the expressions required to trigger a violation when the program is compiled, i.e., variable initialization might be optimized away, forcing FuSeBMC to generate a test-case with undefined behavior.

Regarding our fuzzing engine, we identified a limitation to handle programs with pointer dereferences. The fuzzing engine keeps track of variables throughout the program but has issues identifying when they go out of scope. When we try to generate a test-case that triggers a pointer dereference, our fuzzing engine provides thrash values, and the selective fuzzer might create test-cases that do not reach the error.

# **3 Tool Setup and Configuration**

In order to run our fusebmc.py script,<sup>5</sup> one must set the architecture (*i.e.*, 32 or 64-bit), the competition strategy (i.e., *k*-induction, falsification, or incremental BMC), the property file path, and the benchmark path, as:

```
fusebmc.py [-a {32, 64}] [-p PROPERTY_FILE]
                 [-s {kinduction,falsi,incr,fixed}]
                 [BENCHMARK_PATH]
```
where -a sets the architecture, -p sets the property file path, and -s sets the strategy (e.g., kinduction, falsi, incr, or fixed). For Test-Comp'21, FuSeBMC uses incr for incremental BMC.

When choosing the fuzzing engine, we set the following options when executing Map2Check: timeout of 150 seconds for Map2Check in Cover-Error, and a timeout of 70 seconds in Cover-Branches; --fuzzer-mb 1000 limits memory to 1000 MB; --target-function-name reach−error defines the function name to be searched; --target-function checks whether the target-function is reachable; --nondet-generator fuzzer uses only fuzzing; --generate-witness sets the witness output path.

By choosing incremental BMC, the following options are set when executing ESBMC: --no-div-by-zero-check disables the division by zero check (required by Test-Comp); --force-malloc-success sets that all dynamic allocations succeed (a Test-Comp requirement); --floatbv enables floating-point SMT encoding; --incremental-bmc enables incremental BMC; --unlimited-k-steps removes the upper limit of iteration steps for incremental BMC; --witness-output sets the witness output path; --no-bounds-check and --no-pointer-check disable bounds-check and pointer-safety checks, resp., since we are only interested in finding reachability bugs; --k-step 5 sets the incremental BMC to 5; --no-allign-check disables pointer alignment checks; and --no-slice disables slicing of unnecessary instructions.

The Benchexec tool info module is named fusebmc.py and the benchmark definition file is FuSeBMC.xml.

<sup>5</sup> https://gitlab.com/sosy-lab/test-comp/archives-2021/-/blob/master/2021/ FuSeBMC.zip

#### **4 Software Project**

The FuSeBMC source code is written in C++ and it is available for downloading at GitHub,<sup>6</sup> which includes the latest release of FuSeBMC v3.6.6. FuSeBMC is publicly available under the terms of the MIT License. Instructions for building FuSeBMC from the source code are given in the file README.md (including the description of all dependencies).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

<sup>6</sup> https://github.com/kaled-alshmrany/FuSeBMC

# **Symbiotic 8: Parallel and Targeted Test Generation (Competition Contribution)**

Marek Chalupa -, Jakub Nov´ak, and Jan Strejˇcek

#### Masaryk University, Brno, Czech Republic


**Abstract.** The setup of Symbiotic 8 for Test-Comp 2021 brings radical changes in the test generation for coverage-branches property. Similarly as in Symbiotic 7, we generate tests by running our fork of symbolic executor Klee on the analyzed program. Symbiotic 8, however, runs several instances of Klee in parallel. We run one instance of Klee on the original program and, simultaneously, we create one (intentionally unsound) program slice for every program-terminating instruction in the program and run Klee on these slices. Apart from this principal change, we also improved other components of the tool, mainly the program slicer. Further, our fork of Klee now supports symbolic pointer arithmetics and comparison of symbolic addresses.

### **1 Test-Generation Approach**

Symbiotic [3,2] is an open-source program analysis framework that combines static analyses with code transformations in order to enable faster analysis of the code. In the setup for Test-Comp 2021, Symbiotic uses *program slicing* [6] in combination with *symbolic execution* [5].

Static (backward) program slicing [6] is a technique that removes program instructions that have no influence on reachability or the effect of selected parts of the program. In Test-Comp, we use program slicing for all properties. For coverage-error-call property, we slice the program to remove instructions that cannot affect reachability of the error location. For coverage-branches property, we use program slicing to create modified versions of the program on which we are likely to quickly generate tests that reach hard-to-cover parts of the program.

Symbolic execution [5] is a program analysis technique that enumerates all possible execution paths of a program. For every path, it computes its corresponding *path condition*, which is a collection of constraints on program inputs that forms the necessary and sufficient condition to follow the path. Each path condition is then used to create a test that makes the program execute the given path.

#### **1.1 Workflow of Symbiotic 8**

The workflow of Symbiotic 8 in Test-Comp 2021 for the property coverage-error-call is the same as in Symbiotic 7: we slice the analyzed program with respect to calls of the error function and run Klee on this sliced program. If Klee finds a feasible path that calls the error function, we attempt to replay this path in the unsliced program to fill in the possibly missing values returned from calls to functions VERIFIER nondet \* that may have been sliced away.

The workflow for the property coverage-branches changed significantly in Symbiotic 8. For this property, we run several instances of Klee in parallel: one instance on the original program and other instances on slices generated for every terminating location in the program.

More precisely, we create a pool of processes that keeps running at most 8 processes at the same time (on the first-come-first-served basis). We start an instance of Klee on the original program and add it to the pool. Then we identify instructions in the program that terminate the execution (further referred to as *targets*). For each target, we create a slice and queue a run of Klee on this slice.

These slices are *unsound* in the sense that they do not preserve all execution paths to the targets. A slice is constructed in two steps:


For example, consider the code on the left in Figure 1. It contains three possible targets, namely error() (line 7), abort() (line 13), and return 0 (line 17). If we slice with respect to the target error(), we start searching the program backwards from this target and get all instructions in the body of function foo. Then we pop up from the call to line 16 and collect all instructions of function main except the call to abort (from which the call to foo is unreachable). All instructions except the gathered ones are replaced with a call to abort. Standard program slicing then produces the program depicted in the middle in Figure 1 (in this case, it just removes the return). The slice for the target abort() preserves only three first lines of main as depicted on the right in Figure 1.

Whenever the *main* instance of Klee finishes tests generation, we have tests for all feasible execution paths of the program. Therefore, we kill all other running instances of Klee and discard tests that were not generated by the main instance to reduce the size of the test suite. If the main instance does not finish before timeout, we keep all generated tests.

Using the unsound slices aims only to help reaching hard-to-cover places in the program. In particular, potentially expensive detours are replaced by abort and symbolic execution thus does not waste resources to discover them (see line 2 in the middle in Figure 1). The current construction of unsound slices

```
1 int inc( int x) {
2 return x + 1;
3 }
4
5 void foo( int x) {
6 if (x > 0)
7 error();
8 }
9
10 int main() {
11 int y = nondet();
12 if (y < 0)
13 abort();
14 if (y == 0)
15 y = inc(y);
16 foo(y);
17 return 0;
18 }
                            int inc( int x) {
                                abort();
                            }
                            void foo( int x) {
                                if (x > 0)
                                    error();
                            }
                            int main() {
                                int y = nondet();
                                if (y < 0)
                                    abort();
                                if (y == 0)
                                    y = inc(y);
                                foo(y);
                            }
                                                     int main() {
                                                         int y = nondet();
                                                         if (y < 0)
                                                            abort();
                                                     }
```
**Fig. 1.** And example of a program (left) and its unsound slice with respect to the call of error() (middle) and abort() (right).

guarantees that if a test covers a target in the corresponding slice, then it covers the same target also in the original program. The opposite implication does not hold due to the unsoundness. Note that tests generated from the slices may not and usually do not cover all branches in the original program, therefore we still need to run Klee on the original program.

# **2 Software Architecture**

All parts of Symbiotic 8 use llvm 10 [7]. We compile the analyzed program into llvm bitcode by the compiler Clang.

To carry out symbolic execution, we use our fork of the open-source symbolic executor Klee [1]. The fork has several modifications compared to the mainstream Klee. The main modification is the representation of pointers as segment-offset pairs that enables symbolic-sized allocations. Since this year, our fork Klee also supports comparison of and arithmetic on symbolic pointers. We use Z3 [4] as the SMT solver in Klee. The components of Symbiotic are programmed in C++ and the scripts that schedule and control running these components are written in Python.

### **3 Strengths and Weaknesses**

Although symbolic execution is very good in generating test-cases, it suffers from the *path explosion* problem. This problem emerges on programs that contain many branching instructions or loops with the number of iterations dependent on the input and may hinder symbolic execution from exploring "deep" parts of the

**Fig. 2.** The coverage achieved by Symbiotic 8 and 7 on individual benchmarks of the *Cover-Branches* category

program. Using unsound program slices for terminating instructions attempts to alleviate this problem. Although the slice is not guaranteed to preserve paths to the target for which it was created, there are programs where this technique helps symbolic execution to cover substantially more instructions. However, there are also many cases where the technique worsens the coverage alike.

Figure 2 illustrates the overall positive and negative effect of this approach. The scatter plot on the left compares the coverage achieved by Symbiotic 8 and the coverage achieved by Symbiotic 7 on individual benchmarks that were used in both Test-Comp 2020 and 2021.<sup>1</sup> The scatter plot shows that the behavior of the tool changes dramatically. To summarize the data, we compute the difference between the two coverages on each benchmark (for example, if Symbiotic 8 achieves 80% and Symbiotic 7 60% coverage, the difference is +20%). The histogram on the right indicates that the overall effect of unsound slices is positive as the distribution is skewed to positive values. Indeed, Symbiotic 8 won the 3rd place in the category *Cover-Branches* (corresponding to coverage-branches property) in Test-Comp 2021 which is a big improvement over the previous Test-Comp, where Symbiotic was 8th out of 9 participants in this category.

The workflow of Symbiotic on coverage-error-call did not change from the last year and thus the results are similar.

### **4 Tool Setup and Configuration**

The archive is available at https://doi.org/10.5281/zenodo.4491729. Run Symbiotic with the following command

<sup>1</sup> The use of unsound slices is not the only difference between Symbiotic 8 and 7, but we believe that it has the biggest impact on the presented results.

bin/symbiotic --test-comp --prp <prpfile> [--32] <source>

where --prp sets the verified property and --32 tells Symbiotic to assume 32-bit architecture (64-bit architecture is assumed by default). The generated test-cases are stored in the directory test-suite.

# **5 Software Project and Contributors**

Symbiotic 8 as it competes in Test-Comp 2021 has been developed by Marek Chalupa and Jakub Nov´ak under the supervision of Jan Strejˇcek. The tool and its components are available under MIT License. llvm, Klee, and Z3 are also available under open-source licenses. The project web page is:

https://github.com/staticafi/symbiotic

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Author Index

Aceto, Luca 3 Aguirre, Nazareno 218 Albert, Elvira 24 Alshmrany, Kaled M. 363 Apel, Sven 67 Arroyo, Marcelo 218 Attard, Duncan Paul 3 Bae, Doo-Hwan 292 Barkowsky, Matthias 262 Beyer, Dirk 341 Burger, Erik 87 Chalupa, Marek 368 Chen, Yixiong 46 Cho, Eunho 292 Cordeiro, Lucas C. 363 Dimovski, Aleksandar S. 67 Francalanza, Adrian 3 Frias, Marcelo F. 218 Gadelha, Mikhail R. 363 Garlan, David 130 Giese, Holger 196, 262 Gleitze, Joshua 87 Godio, Ariel 218 Gurov, Dilian 152 Hähnle, Reiner 24 Haltermann, Jan 108 Hammer, Christian 240 Ingólfsdóttir, Anna 3 Jakobs, Marie-Christine 358 Kang, Eunsuk 130 Klare, Heiko 87 Legay, Axel 67 Lei, Zhanyao 46

Li, Nianyu 130 Lidström, Christian 152 Majumdar, Rupak 172 Mathur, Aman 172 Maximova, Maria 196 Menezes, Rafael S. 363 Merayo, Alicia 24 Novák, Jakub 368 Pirron, Marcus 172 Ponzio, Pablo 218 Prakash, Jyoti 240 Qi, Zhengwei 46 Ray, Baishakhi 313 Richter, Cedric 358 Rosner, Nicolás 218 Sakizloglou, Lucas 262 Schneider, Sven 196 Schumi, Richard 269 Shin, Yong-Jun 292 Stegner, Laura 172 Steinhöfel, Dominic 24 Strejček, Jan 368 Sun, Jun 269 Tian, Yuchi 313 Tiwari, Abhishek 240 Wehrheim, Heike 108 Xia, Mingyuan 46 Yang, Yang 46 Zhang, Mingyue 130 Zhong, Ziyuan 313 Zufferey, Damien 172